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Abstract 


This  thesis  presents  fast  hypercube  and  shuffle-exchange  algorithms  for  certain  load  bal¬ 
ancing,  selection  and  sorting  problems.  Non-trivial  lower  bounds  are  established  for  load 
balancing  and  selection.  In  addition,  efficient  network  implementations  of  the  parallel  prefix 
operation  and  of  the  elementary  Boolean  matrix  multiplication  algorithm  are  described. 
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Chapter  1 


Introduction 


A  considerable  amount  of  research  effort  in  the  field  of  parallel  computation  has  concentrated 
on  developing  algorithms  for  idealized  machine  models.  The  primary  example  of  this  is  the 
PRAM  model  of  Fortune  and  Wyllie,  which  assumes  the  existence  of  a  shared  memory 
allowing  simultaneous  random  access  by  an  unbounded  number  of  processors  [FW78]. 

This  thesis  adds  to  the  growing  body  of  work  that  addresses  the  design  and  analysis 
of  algorithms  for  more  realistic  models  of  paxallel  computation.  Specifically,  all  of  the  al¬ 
gorithms  to  be  described  are  designed  to  run  on  sparse  interconnection  networks  such  as 
the  hypercube  and  shuffle-exchange.  Algorithms  for  performing  operations  such  as  paral¬ 
lel  prefix,  matrix  multiplication,  load  balancing,  selection  and  sorting  will  be  considered. 
The  primary  motivation  for  developing  fast  implementations  of  these  basic  operations  is  to 
provide  useful  primitives  for  writing  higher-level  parallel  programs. 


1.1  Notation  and  Terminology 

A  p  processor  fixed  interconnection  network  may  be  viewed  as  an  undirected  graph,  where 
vertices  correspond  to  processors  and  edges  correspond  to  bidirectional^  communication 
channels.  Each  processor  has  an  infinite  local  memory,  and  a  unique  integer  ID.  There  is 
no  global  memory;  processors  communicate  with  one  another  by  sending  and  receiving  data 
over  the  channels  provided  by  the  network.  In  order  to  discuss  the  time  complexity  of  an 
algorithm  it  is  necessary  to  define  exactly  what  operations  can  be  performed  in  a  single  unit 
of  time,  or  time  step.  For  establishing  asymptotic  upper  bounds,  it  is  realistic  to  assume 
that: 

^With  respect  to  the  shuffle-exchange,  the  ability  to  “unshuffle”  data  is  assumed. 
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Figure  1.1:  A  hypercube  of  dimension  4  drawn  with  circular  edges. 

1.  Memory  is  configured  in  O(logp)  bit  words. 

2.  In  a  single  time  step,  a  processor  can  send  and/or  receive  a  single  word  of  data  and 
perform  0(1)  CPU  operations  on  word-sized  operands. 

Most  of  the  algorithms  described  in  this  thesis  are  designed  to  run  on  the  hypercube 
and  shuffle-exchange  network  families.  A  dimension  d  hypercube  has  2^  processors  with  IDs 
ranging  from  0  to  2^  —  1.  Processor  i  is  adjacent  to  processor  j  if  and  only  if  the  binary 
representations  of  i  and  j  differ  in  a  single  bit  position.  A  hypercube  of  dimension  4  is 
depicted  in  Figure  1.1. 

The  shuffle-exchange  was  introduced  by  Stone  [Sto71].  Like  the  hypercube,  a  shuffle- 
exchange  of  dimension  d  has  2^  processors  with  IDs  ranging  from  0  to  2^^^.  Processor 
i  =  is  connected  to  processors  Exchange{i)^  Shuffle(i)  and  Unshuffle{i)y  where 

Exchange{i)  -  {id-i  •  *  *  ii{io  ©  1))2, 

Shuffle{i)  =  and 

Unshuffle{i)  =  (ioid^i  •  •  •  ^1)2, 

0  <  i  <  d,  A  shuffle-exchange  of  dimension  4  is  depicted  in  Figure  1.2. 

Some  important  properties  of  the  hypercube  and  shuffle-exchange  network  families  are 
summarized  in  Table  1.1.  Note  that  the  degree  of  the  shuffle-exchange  is  constant,  while 
that  of  the  hypercube  is  unbounded.  Furthermore,  the  optimal  VLSI  layout  area  of  the 
shuffle-exchange  is  somewhat  smaller. 

A  more  powerful  model  of  the  hyper  cube  wiU  also  be  considered,  one  which  does  not 
adhere  to  the  l-port  restriction  on  communication  imposed  above.  This  is  the  pipelined 
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Figure  1.2:  A  shuffle-exchange  of  dimension  4  embedded  on  a  circle. 


Network 

Processors 

Degree 

Diameter 

Layout  Area 

hypercube 

V 

logp 

logp 

0(F) 

shuffle-exchange 

p 

3 

2  logp 

0(p2/log2  p) 

Table  1.1:  Important  properties  of  the  hypercube  and  shuffle-exchange. 


hypercube  model  of  Varman  and  Doshi  [VD88].  The  pipelined  hypercube  remains  a  realistic 
model  of  computation  by  providing  only  a  very  restrictive  form  of  d-port  communication. 
Communication  on  the  pipelined  hypercube  is  via  word-sized  packets,  routed  according  to 
the  following  simple  scheme.  Address  bits  are  successively  corrected  in  either  ascending  or 
descending  (as  determined  by  the  sender)  order  of  significance,  with  no  collisions  permitted. 
A  collision  occurs  when  two  packets  attempt  to  traverse  the  same  edge  in  the  same  direction 
at  the  same  time. 

In  routing  a  packet,  one  time  step  is  expended  for  each  bit  in  the  smallest  contiguous 
block  of  address  bits  that  contains  all  of  the  bits  to  be  corrected.  For  example,  a  packet  sent 
from  processor  IOIIOIOI2  to  processor  IOOIOOII2  must  pass  through  dimensions  1,  2,  and 
5.  Assuming  that  the  sending  processor  elects  to  have  address  bits  corrected  in  descend¬ 
ing  order  of  significance,  the  packet  would  be  routed  according  to  the  schedule  given  by 

the  following  list  of  (time,  processor)  pairs:  (0,101101012),  (1,100101012),  (2,100101012), 
(3,100101012),  (4,100100012),  (5,100100112).  This  packet  is  sent  by  processor  IOIIOIOI2, 
received  and  sent  by  processors  IOOIOIOI2  and  10010001,  and  received  by  IOIIOOII2.  Let 
the  first  sender  (IOIIOIOI2,  in  this  example)  be  called  the  originator  of  the  packet,  and 
let  the  last  receiver  (IOOIOOII2,  in  this  example)  be  called  the  acceptor  of  the  packet.  The 
pipelined  hypercube  imposes  the  following  pair  of  restrictions  on  communication: 
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L  Each  processor  is  allowed  to  originate  and/or  accept  at  most  one  packet  per  time  step. 

2.  Each  edge  can  transmit  at  most  one  packet  in  each  direction  per  time  step. 

The  pipelined  hypercube  model  is  not  realistic  in  a  strict  sense,  since  a  single  RAM 
cannot  hope  to  examine  O(logp)  packets  in  0(1)  time.  However,  it  may  be  a  useful  model 
in  practice  since  the  only  additional  hardware  required  at  each  hypercube  processor  is  a 
ring  of  O(logp)  trivial  coprocessors  to  handle  the  packet  routing  scheme  described  above. 
Viewing  this  as  an  enhancement  to  the  O(logp)  I/O  channel  hardware  already  required  by 
a  hypercube  processor,  one  would  expect  to  suffer  only  a  small  constant  factor  increase  in 
the  VLSI  area  needed  to  implement  a  processor. 

Several  comments  should  be  made  with  regard  to  mathematical  notation.  First,  all 
logarithms  are  to  be  taken  base  2,  that  is,  logo:  denotes  log2a:.  Second,  it  wiU  sometimes 
be  convenient  to  make  use  of  the  function  l0g,  defined  as 

l0ga;  =  max{loga:,  1}. 

Finally,  [a,  6)  will  designate  {i  |  a  <  i  <  6}. 


1.2  Thesis  Organization 

The  following  is  an  overview  of  the  main  results  contained  in  the  thesis. 

Chapter  2,  which  represents  joint  work  with  Ernst  Mayr,  provides  pipelined  paral¬ 
lel  prefix  algorithms  for  the  complete  binary  tree,  hypercube  and  shuffle-exchange.  This 
primitive  is  used  to  develop  a  pipelined  version  of  the  multi-way  merge  sort  of  Nassimi 
and  Sahni  [NS82]  that  runs  on  the  pipelined  hypercube.  Given  p  processors  and  n  < 
plogp  keys  to  be  sorted,  the  running  time  of  the  pipelined  hypercube  sorting  algorithm  is 
0(log^p/log((plogp)/7i)),  which  improves  (asymptotically)  upon  Batcher’s  bitonic  sort  by 
a  log  log  p  factor  in  the  important  case  n  =  p. 

It  has  been  shown  that  the  product  of  two  n  x  n  Boolean  matrices  can  be  computed 
in  O(logn)  time  on  a  hypercube  or  shuffle-exchange  with  0{n^)  processors.  Chapter  3 
reduces  this  processor  requirement  to  0(n^/(log^  n  log  log  n))  by  making  use  of  simulation 
techniques  and  a  parallel  version  of  the  Four  Russians’  algorithm.  This  bound  improves 
upon  a  result  of  Agerwala  and  Lint  by  a  factor  of  log  n  [AL78]. 
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Maintaining  a  balanced  load  is  of  fundamental  importance  on  any  parallel  computer, 
since  a  strongly  imbalanced  load  often  leads  to  low  processor  utilization.  Chapter  4  con¬ 
siders  two  load  balancing  problems.  First,  given  n  tokens  arbitrarily  distributed  over  a  p 
processor  network  with  no  more  than  m  tokens  at  any  one  processor,  how  fast  can  the  tokens 
be  redistributed  so  that  each  processor  holds  the  same  number?  Second,  given  n  tokens 
uniformly  distributed  over  a  p  processor  network,  and  arbitrarily  partitioned  into  g  groups, 
how  fast  can  the  tokens  be  redistributed  so  that  each  processor  holds  the  same  number 
from  each  group?  Upper  bounds  on  the  worst  case  complexity  of  these  two  problems  are 
obtained  from  the  analysis  of  practical  algorithms  for  the  hypercube,  pipelined  hypercube 
and  shuffle-exchange.  Matching  lower  bounds  are  also  provided  for  certain  cases.  Average 
performance  is  also  considered. 

Chapters  5  and  6  are  concerned  with  the  problem  of  selection,  that  is,  determining  the 
A;th  largest  key  out  of  a  given  set  of  n  keys.  Three  different  selection  algorithms  are  given 
in  Chapter  5,  each  of  which  represents  the  best  known  selection  algorithm  over  some  range 
of  the  ratio  n/p  {p  is  the  number  of  processors)  for  one  or  more  of  the  networks  under 
consideration.  One  of  the  algorithms  is  based  on  fast  sorting  of  small  sets,  one  is  based  on 
load  balancing,  and  one  is  based  on  a  sequential  tradeoff  between  preprocessing  and  search 
time  in  a  partial  order.  For  n  >  plog^p,  the  latter  algorithm  runs  in  0((n/p)  log  log p) 
time  on  the  hypercube  and  shuffle  exchange.  Since  the  sequential  complexity  of  selection 
is  linear,  one  might  hope  that  for  njp  sufficiently  large  the  log  log  p  factor  in  this  running 
time  could  be  eliminated.  However,  Chapter  6  proves  that  the  log  log  p  factor  cannot  be 
eliminated.  Specifically,  a  lower  bound  of  f)((n/p)loglogp  -h  logp)  is  established  for  a 
large  class  of  networks  that  includes  the  complete  binary  tree,  multi-dimensional  mesh, 
hypercube,  butterfly  and  shuffle-exchange. 

Chapters  7  and  8  deal  with  the  problem  of  sorting  a  set  of  n  keys  with  p  processors. 
Chapter  7  makes  use  of  the  load  balancing  and  selection  results  of  Chapters  4  and  5  to 
derive  fast,  practical  sorting  algorithms  for  the  hypercube,  shuffle-exchange  and  pipelined 
hypercube.  Chapter  8  considers  two  approaches  to  sorting  when  algorithms  are  confined  to 
a  class  corresponding  essentially  to  sorting  circuits.^  For  sufficiently  large  values  of  the  ratio 
n/p,  all  of  the  sorting  algorithms  described  in  these  two  chapters  have  a  lower  asymptotic 
complexity  than  Batcher’s  bitonic  sort  [Bat68]. 

^The  term  “sorting  circuit”  will  be  used  in  lieu  of  the  more  usual  “sorting  network”  in  order  to  avoid 
confusion  with  interconnection  networks.  For  a  thorough  introduction  to  the  design  and  analysis  of  sorting 
circuits,  see  Knuth  [Knu73]. 


6 


CHAPTER  1.  INTRODUCTION 


Finally,  Chapter  9  offers  some  concluding  remarks  and  open  problems  for  further  con¬ 
sideration. 


Chapter  2 


Pipelining 


This  chapter  combines  several  previously  known  techniques  to  obtain  fast  implementations 
of  the  so-called  parallel  prefix  operation.  Algorithms  are  given  for  the  complete  binary  tree 
as  well  as  the  hypercube  and  shuffle-exchange.  Pipelined  schemes  for  performing  k  prefix 
operations  in  P{k  +  logp)  time  on  p  processors  are  given  for  the  same  set  of  networks. 
Pipelined  parallel  prefix  is  then  used  to  develop  a  simplified  implementation  of  the  optimal 
merging  algorithm  of  Varman  and  Doshi,  which  runs  on  the  pipelined  hypercube  [VD88]. 
Finally,  a  pipelined  version  of  the  multi-way  merge  sort  of  Nassimi  and  Sahni  [NS82],  running 
on  the  pipelined  hypercube,  is  described.  Given  p  processors  and  n  <  plogp  keys  to  be 
sorted,  the  running  time  of  the  pipelined  algorithm  is  O(log^p/log((plogp)/n)).  For  the 
interesting  case  n  =  p  this  yields  a  running  time  of  which  is  asymptotically 

faster  than  Batcher’s  bitonic  sort  [Bat68]. 


2.1  The  Prefix  Operation 

The  prefix  operation  was  introduced  independently  by  Schwartz  [Sch80]  and  by  Ladner 
and  Fischer  [LF80].  For  other  work  on  parallel  prefix,  the  reader  is  referred  to  [Fic83]  and 
[Rei84]. 

Let  ©  denote  a  binary  associative  operator  on  some  domain  X.  Given  {xo, . . . ,  Xn-i}  C 
/T,  the  Prefix  operation  computes  each  of  the  partial  sums  =  xq  ©  •  •  •  ©  0  <  i  <  n.  For 

example,  assuming  that  ©  is  addition,  n  =  5,  xq  =  5,  xi  =  2,  X2  =  6,  X3  =  4  and  X4  =  9, 
then  the  output  of  Prefix  is  yo  =  5,  ?/i  =  7,  y2  =  13,  yz  =  17  and  7/4  =  26. 
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Given  an  additional  n  Boolean  values  ao, .. .  the  n  given  Xi  values  can  be  parti¬ 

tioned  into  contiguous  intervals  in  the  following  manner:  an  interval  begins  at  each  i  such 
that  di  =  true  and  extends  up  to,  but  not  including,  the  next  highest  integer  j  such  that 
Uj  =  true.  The  first  interval  begins  at  processor  0  regardless  of  the  value  of  ao,  and  the  last 
interval  ends  at  processor  n  —  1.  The  segmented  Prefix  operation  executes  a  prefix  operation 
over  each  interval.  Extending  the  example  of  the  preceding  paragraph,  assume  that  a2  and 
a4  are  true  while  ao,  ax  and  as  are  false.  Then  the  Xi  values  are  partitioned  into  the 
intervals  {ro,a:i},  {x2,xs}  and  {X4}  and  the  output  of  the  segmented  Prefix  operation  is 
2/0  =  5,  2/1  =  7,  2/2  =  6,  2/3  =  10  and  2/4  =  9. 

When  implementations  of  the  Prefix  operation  for  various  networks  are  given  in  Sec¬ 
tion  2.2,  it  will  be  convenient  to  assume  that  there  is  an  identity  element  for  0  in  which 
will  be  denoted  O0,  This  assumption  can  be  made  without  loss  of  generality  because  if  no 
such  element  exists,  the  set  X  can  be  augmented  with  an  identity  element  0^  by  defining 
O0  0  a:  =  a:  and  x  0  O0  =  a:  for  all  x  £  X  U  {O0}.  Note  that  associativity  is  preserved. 

Definition  2.1.1  For  all  pairs  of  Boolean  values  ao,ai  and  all  xo,xi  6  X^  let  0'  denote 
the  binary  operator 

(ao,  xo)  0'  (ai,  xi)  =  (ao  or  ai,  if  ax  then  xi  else  xo  0  xi). 

The  operator  0'  will  be  referred  to  as  the  segmented  0  operator. 

Remark  1  The  0'  operator  has  identity  00/  =  (false, O0). 

Remark  2  The  0'  operator  is  not  commutative,  assuming  \X\  >  1. 

Remark  3  The  0'  operator  is  associative. 

Remark  4  For  A:  >  0, 

(ao,  Xo)  0'  •  •  •  0'  (afc,  xjt)  =  (ao  or  •  •  •  or  ajt,  xy  0  •  •  •  0  xjt), 

where  j  is  the  highest  index  less  than  or  equal  to  k  such  that  aj  =  true,  or  0  if  there  is  no 
such  index. 

Remark  1  is  an  immediate  consequence  of  Definition  2.1.1.  For  Remark  2,  let  xo,a;i  be 
distinct  elements  of  X  and  note  that  (true,  xq)  0'  (true,xi)  =  xi  while  (true,xi)  0' 
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(true,xo)  =  a:o.  Remark  3  follows  from  the  observation  that  for  aU  Boolean  values  ao,  «i,  02 
and  X01  xi,X2  £  X 

((ao,a;o)  ©'  (ai,xi))  ©' (a2,X2) 

=  (flo  or  fli,  if  then  xi  else  xo  ©  Xi)  ©'  (02,  X2) 

=  (ao  or  a\  or  02,  if  then  X2  else  if  ai  then  Xi  0  X2  else  xq  ©  Xi  ©  X2) 

=  (ao  or  (ai  or  02),  if  (ai  or  a2)  then  X  else  xq  ©  X) 

=  (ao,xo)©'(ai  or  02,  X) 

=  (ao,xo)e'((ai,Xi)e'(a2,X2)), 

where  X  denotes  the  conditional  expression:  if  02  then  X2  else  xj  ©  X2.  Finally,  Remark  4 
may  be  easily  established  by  induction  on  k. 

Remarks  3  and  4  demonstrate  that  any  segmented  Prefix  operation  with  operator  © 
mapping  A"  x  A’  to  /V  is  equivalent  to  an  ordinary  Prefix  operation  with  operator  ©'  mapping 
(B  X  X)  X  (B  X  X)  to  B  X  X,  where  B  denotes  the  set  of  Boolean  values  {true,  false}.  The 
second  component  of  each  output  pair  is  the  result  of  the  desired  segmented  Prefix  operation, 
and  the  first  component  indicates  whether  or  not  that  processor  belongs  to  an  “undefined” 
interval;  it  is  false  at  processor  i  if  and  only  if  oq,  . . .  ,a,-  are  all  false.  This  reduces  coding 
segmented  prefix  to  coding  ordinary  prefix. 


2.2  Network  Implementations 

This  section  presents  efficient  implementations  of  the  Prefix  operation  for  the  complete 
binary  tree,  hypercube  and  shuffle-exchange  families  of  networks.  It  will  be  assumed  that 
the  network  consists  oi  p  =  n  processors,  and  that  processor  i  initially  contains  the  value 
Xi,  0  <  i  <  p.  The  computation  is  considered  to  be  complete  when  the  partial  sum 
Vi  =  Xo  ©  •  •  •  ©  Xj  has  been  computed  at  processor  i,  0  <  i  <  p.  The  complexity  of  the 
algorithms  will  be  stated  in  terms  of  time  steps,  as  defined  in  Section  1.1.  Unless  otherwise 
stated,  running  times  should  be  assumed  to  be  accurate  to  within  an  additive  constant.  It 
will  be  assumed  that  the  x,  ’s,  as  weU  as  all  partial  sums  of  the  Xj’s,  are  word-sized  quantities. 

In  the  programs  to  foUow,  all  interprocessor  communication  will  be  specified  using  the 
pair  of  routines  Send  and  Receive.  Send  takes  two  arguments:  the  first  specifies  the  word  of 
data  to  be  transmitted,  and  the  second  specifies  the  ID  of  the  destination  processor.  Receive 
is  a  function  with  one  argument  which  specifies  the  ID  of  the  source  processor.  Once  a  packet 
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arrives  from  the  source,  the  word  of  data  contained  in  that  packet  is  returned  as  the  value 
of  the  function.  In  order  to  comprise  a  valid  source/destination  pair,  two  processors  must 
be  adjacent  in  the  network. 

2.2.1  Binary  Tree 

The  first  implementation  of  Prefix  that  will  be  considered  is  the  standard  two-pass  algorithm 
for  the  inorder  complete  binary  tree.  Assume  that  a  binary  tree  of  size  p  =  2^  —  1  is  given, 
with  processors  numbered  inorder  from  0  to  2^  —  2.  An  example  of  such  a  network  is 
shown  in  Figure  2.1,  where  the  processor  IDs  have  been  written  in  binary,  and  d  =  4. 
The  code  for  this  algorithm  assumes  that  each  processor  has  initialized  the  variables  Root^ 
Leaf^  LeftChild^  RightChild  and  Parent  in  the  following  manner.  The  Boolean  variable 
Root  (Leaf)  is  true  if  and  only  if  the  processor  represents  the  root  (a  leaf)  of  the  tree. 
The  integer  variables  LeftChild^  RightChild  and  Parent  hold  the  IDs  of  the  neighboring 
processors,  and  are  undefined  whenever  such  a  neighbor  does  not  exist. 

begin  Prefix(©,  x) 

(1)  XL  —  if  Leaf  then  0^  else  Rece\\/e(LeftChild); 

(2)  xr  ^ —  if  Leaf  then  0®  else  Rece\\/e{RightChild); 

(3)  if  not  Root  then  Send(xL  ©  x  ©  xr,  Parent); 

(4)  i —  if  Root  then  O0  else  Receive(Pareni); 

(5)  VR  < —  2/L  ®  a^L  ©  x; 

(6)  if  not  Leaf  then  Send(7/L?  LeftChild); 

(7)  if  not  Leaf  then  Send(^R,  RightChild); 
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(8)  return(j/R); 
end  Prefix 

As  mentioned  above,  the  program  maJces  two  passes  over  the  tree.  The  first  pass  is 
upward,  from  the  leaves  to  the  root,  and  the  second  pass  is  downward.  For  every  processor 
p,  let  T(p)  denote  the  subtree  rooted  at  processor  p.  Note  that  the  IDs  of  the  processors  in 
r(p)  form  a  contiguous  block  of  integers.  During  the  upward  pass,  each  processor  receives 
the  sum  of  its  left  and  right  subtrees  {xi,  and  xr),  computes  the  sum  over  T(p),  and  passes 
the  result  to  its  parent.  During  the  downward  pass,  each  processor  receives  from  its  parent 
the  sum  pL  over  all  processors  with  IDs  less  than  those  in  r(p),  computes  the  sum  over 
all  processors  with  IDs  less  than  those  in  its  right  subtree  (pr),  and  sends  the  appropriate 
values  to  its  left  and  right  children  (pL  and  pr).  The  correctness  of  the  program  is  easily 
established  by  induction  on  the  depth  of  the  tree,  and  it  runs  in  4logp  time  steps. 

Note  that  in  any  given  time  step,  only  two  of  the  levels  of  the  tree  are  active,  implying 
that  the  algorithm  can  be  pipelined  level  by  level.  By  initiating  a  new  prefix  computation 
every  second  time  step,  it  is  possible  to  perform  k  Prefix  operations  on  the  inorder  complete 
binary  tree  in  2k  +  41ogp  time  steps. 

2.2.2  Hypercube 

For  the  hypercube,  the  following  FFT-like  computation  executes  Prefix  in  logp  time  steps: 

begin  Prefix(©,  x) 

(1)  y  i —  x; 

(2)  for  i  < —  0  to  d  -  1  do 

(3)  Send(j/,  i); 

(4)  if  Myidi  =  0  then 

(5)  y  < —  y  0  Receive(i); 

(6)  else 

(7)  temp  < —  Receive(i); 

(8)  X  < —  temp  0  x; 

(9)  y  < —  temp  0  y; 

(10)  end  if 

(11)  end  for 

(12)  return(x); 
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end  Prefix 

The  variable  Myld  holds  the  ID  of  the  processor,  and  Myld^  denotes  the  ith  bit  of  the 
ID  (the  least  significant  bit  is  bit  0).  The  source  and  destination  arguments  of  Send  and 
Receive  specify  the  bit  position  in  which  the  two  communicating  processors  differ. 

The  program  runs  in  logp  time  steps,  and  functions  in  the  following  manner.  In  ad¬ 
dition  to  the  partial  sums  demanded  by  the  Prefix  operation,  the  total  sum  is  computed 
at  every  processor.  The  local  variables  x  and  y  accumulate  the  partial  and  total  sums, 
respectively.  For  a  hypercube  consisting  of  a  single  processor,  the  computation  is  trivial. 
Given  p  processors  with  associated  Xi  values  and  where  p  =  2^,  d  >  1,  the  program  first 
recursively  computes  partial  and  total  sums  for  the  upper  and  lower  halves  of  the  values 
independently,  and  then  exchanges  the  total  sums  between  halves.  This  enables  the  revised 
partial  sums  for  the  upper  half  to  be  computed,  as  well  as  the  new  total  sums. 

Unfortunately,  the  above  program  does  not  lead  to  a  pipelined  implementation  of  the 
Prefix  operation  because  it  uses  all  of  the  processors  at  every  time  step.  One  way  of  achiev¬ 
ing  pipelined  speedup  is  to  make  use  of  the  dilation  2  inorder  complete  binary  tree  embed¬ 
ding  [BCLR86].  Figure  2.2  gives  this  embedding  for  the  case  p  =  16,  where  the  “extra” 
processor  (with  ID  p  —  1)  has  been  added  as  an  extra  level  above  the  root.  The  edges 
depicted  in  Figure  2.2  are  physical  hypercube  edges.  The  left  child  of  a  non-leaf  processor 
is  connected  directly  to  its  parent,  while  the  right  child  is  connected  to  its  parent  via  the 
left  child.  It  is  easy  to  verify  that  the  pipelined  algorithm  given  for  the  inorder  complete 
binary  tree  in  Section  2.2.1  can  be  modified  to  run  in  the  same  time  bound  on  the  dilation 
2  inorder  complete  binary  tree  embedding.  In  particular,  note  that  processor  p  —  1  is  in  an 
appropriate  location  to  receive  the  sum  over  all  of  the  other  processors.  To  summarize,  k 
Prefix  operations  can  be  performed  in  2k  +  41ogp  time  steps  on  the  hypercube. 

2.2.3  Shuffle-Exchange 

The  hypercube  code  given  in  the  preceding  section  for  performing  a  single  Prefix  operation 
can  be  easily  adapted  to  the  shuffle-exchange: 

begin  Prefix(0,  x) 

(1)  y< — x; 


(2)  repeat  d  times 
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(3)  Sencl(i/,  Exchange); 

(4)  if  MyIdQ  =  0  then 

(5)  y  < —  y  ©  Rece\\/e{Exchang€); 

(6)  else 

(7)  temp  < —  Rece\\/e{Exchange); 

(8)  X  i —  temp  ©  x; 

(9)  y  ^ —  temp  ©  y; 

(10)  end  if 

(11)  Send(x,  Unshuffle); 

(12)  X  i —  Rece\\/e{Shuffl€); 

(13)  Send(y,  Unshuffle); 

(14)  y  < —  Rece\ye{Shuffle); 

(15)  end  repeat 

(16)  return(x); 
end  Prefix 

The  above  program  runs  in  31ogp  time  steps.  As  in  the  case  of  the  hypercube,  however, 
a  different  approach  is  needed  in  order  to  obtain  a  pipelined  implementation  of  the  Prefix 
operation.  Unfortunately,  it  is  not  possible  to  embed  the  inorder  complete  binary  tree  in 
the  shuffle-exchange  with  constant  dilation.  Instead,  a  pipelined  implementation  will  be 
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obtained  by  making  use  of  the  dilation  2  complete  binary  tree  embeddings  depicted,  for 
the  case  p  —  16,  in  Figures  2.3  and  2.4.  The  leaves  of  the  tree  in  Figure  2.3  are  the  high- 
numbered  processors  (those  with  IDs  in  the  range  p/2  to  p  —  1),  numbered  inorder.  In 
this  embedding,  the  ID  of  the  left  child  of  an  internal  processor  is  the  shuffle  of  the  ID 
of  its  parent,  and  siblings  communicate  via  the  exchange  connection.  The  embedding  of 
Figure  2.4  is  defined  in  a  similar  fashion,  and  hcis  the  low-numbered  processors  (0  to  p/2—  1) 
at  its  leaves. 

These  embeddings  can  be  used  to  obtain  a  pipelined  implementation  of  k  Prefix  opera¬ 
tions  as  follows.  First,  the  embedding  of  Figure  2.3  is  used  to  compute  the  k  sets  of  partial 
sums  over  the  high-numbered  processors.  This  takes  2k  +  41ogp  time  steps.  Similarly, 
the  embedding  of  Figure  2.4  can  be  used  to  perform  k  prefix  sums  over  the  low-numbered 
processors  in  2k  -f-  41ogp  time  steps.  At  this  point,  all  that  remains  to  be  done  is  to  broad¬ 
cast,  in  a  pipelined  fashion,  the  k  total  sums  over  the  low-numbered  processors  to  the  p/2 
high-numbered  processors,  and  to  add  these  values  to  the  partial  sums  computed  earlier. 
This  last  phase  can  be  performed  in  2k  -f  21ogp  time  steps  using  the  embedding  of  Fig¬ 
ure  2.4.^  Hence,  k  Prefix  operations  can  be  executed  in  6k  +  10  log  p  time  steps  on  the 
shuffle-exchange. 


^Note  that  as  a  side-effect  of  the  prefix  sums  performed  over  the  low-numbered  processors,  the  desired 
sums  are  already  available  at  the  root. 
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2.2.4  A  Useful  Variation 

Section  2.4  will  make  use  of  a  variant  of  the  Prefix  operation,  Prefix',  defined  as  follows. 
Rather  than  computing  xq  ©  •  •  •  0  a:,-  at  processor  i,  0  <  i  <  p,  Prefix'  outputs  O0  at 
processor  0  and  xo®‘  •  ’©a:*-!  at  processor  i,  1  <  i  <  p.  This  is  sometimes  more  convenient, 
particularly  when  the  operator  ©  is  not  invertible.  Note  that  all  of  the  implementations  of 
Prefix  given  above  may  be  trivially  modified  to  implement  Prefix'  within  precisely  the  same 
time  bounds.  For  example,  in  the  complete  binary  tree  program  of  Section  2.2.1,  it  suffices 
to  change  the  return  value  from  j/r  to  j/l  © 


2.3  Data  Distribution 

Consider  the  binary  associative  operator  ©  defined  over  /T  by  a:  ©  ?/  =  a;,  for  all  x,  t/  €  A'. 
This  is  sometimes  referred  to  as  the  Copy  operator.  Observe  that  the  effect  of  applying 
Prefix  with  the  Copy  operator  is  to  perform  a  broadcast  of  a  single  value  from  processor  0  to 
aU  other  processors.  Of  course,  there  are  simpler  techniques  for  broadcasting  a  single  value 
over  the  processors  of  any  of  the  networks  under  consideration.  However,  combining  this 
observation  with  the  results  of  the  previous  section  immediately  implies  that  k  segmented 
broadcasts  can  be  executed  in  2k  +  41ogp  time  steps  on  the  tree  or  hypercube,  and  in 
6Ar  +  10  log  p  time  steps  on  the  shuffle-exchange. 

In  order  to  fully  illustrate  the  techniques  discussed  in  Section  2.1,  the  implementation 
of  segmented  Prefix  with  the  Copy  operation  will  now  be  studied  in  greater  detail.  As  stated 
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in  Section  2.1,  processor  i  initially  holds  the  Boolean  value  ai  and  Xi  €  A',  0  <  i  <  p.  Note 
that  under  the  Copy  operation  the  only  relevant  Xi^s  are  those  for  which  the  corresponding 
ai  is  true  . 

Clearly,  there  is  no  identity  element  for  the  Copy  operation  in  /V.  To  remedy  this 
situation,  the  domain  of  Copy  is  extended  from  X  to  B  x  X  where  every  pair  with  first 
component  false,  say,  is  defined  to  be  an  identity  element.  In  practice,  this  corresponds  to 
prefixing  a  single  bit  bi  to  each  of  the  x^’s.  Formally,  the  operator  0  =  Copy  becomes 

®  =  {bo  or  6i,  if  6o  then  Xq  else  Xi), 

for  all  609^1  C  B  and  xq^xi  G  X, 

In  order  to  reduce  segmented  Prefix  with  operator  ©  =  Copy  to  ordinary  Prefix  with 
operator  ©'  =  Copy',  let  0'  be  defined  as  follows: 

(ao,(i^o,2:o))0'  (^i,  a:i))  =  (uq  or  ai,  if  ai  then  (61,0:1)  else  (60,0:0)  ©  (61,0:1)). 

Dropping  the  inner  parentheses  and  simplifying  gives 

(^0,60,0:0)  0  (r^i,6i,o:i)  —  (no  or  ni, 

if  ai  then  61  else  60  or  61, 
if  ai  or  not  60  then  xi  else  xo). 

Note  that  the  above  formulation  allows  bit  pipelining  in  the  sense  described  by  Blel- 
loch  [Ble87].  In  other  words,  as  each  bit  of  the  two  operands  is  received,  the  next  output 
bit  can  be  computed.  This  property  holds  not  only  for  the  Copy  operator,  but  also  for  any 
other  single-pass  operator^  as  defined  by  Blelloch  [Ble87]. 

Finally,  observe  that  the  data  distribution  operation  defined  by  UUman  [U1184]  is  equiv¬ 
alent  to  a  segmented  Prefix  operation  with  the  Copy  operator.  Thus,  the  techniques  outlined 
in  Section  2.2  immediately  lead  to  efficient  pipelined  implementations  of  this  primitive  for 
the  complete  inorder  binary  tree,  hypercube  and  shuffle-exchange  network  families. 


2.4  Sorting  on  the  Pipelined  Hypercube 

Ill  this  section,  a  simplified  implementation  of  the  optimal  merging  algorithm  of  Varman 
and  Doshi  [VD88]  will  be  described,  and  this  will  be  used  to  develop  a  pipelined  version  of 
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the  sorting  algorithm  of  Nassimi  and  Sahni  [NS82]  for  the  pipelined  hypercube.  A  formal 
definition  of  the  Sort  operation,  along  with  a  discussion  of  previous  sorting  results  for  the 
hypercube  and  shuffle-exchange,  may  be  found  in  Chapter  7. 

The  added  power  of  the  pipelined  hypercube  will  only  be  used  for  performing  pipelined 
inverse  concentration  routes.  It  is  interesting  to  note  that  the  pipelined  hypercube  is  not 
needed  in  order  to  perform  pipelined  concentration  routes,  nor  is  it  needed  to  perform  the 
pipelined  inverse  concentration  with  copy  operation  of  Varman  and  Doshi.  Concentration 
and  inverse  concentration  routes  were  defined  by  Nassimi  and  Sahni  [NS82],  and  it  is  easy 
to  show  that  k  such  operations  can  be  performed  in  fc  -h  logp  time  steps  on  the  pipelined 
hypercube.  Furthermore,  there  is  no  hope  of  achieving  this  asymptotic  time  bound  on  the 
1-port  hypercube  since  there  is  a  lower  bound  of  Q.{k  \og^ p)  time  steps  in  this  case.  To 
prove  this  lower  bound,  consider  a  set  of  k  monotone  routes  for  which  the  source  processors 
are  exactly  those  with  strictly  more  O’s  than  I’s  in  their  IDs  and  the  destination  processors 
are  those  with  more  I’s  than  O’s.  In  such  a  case,  D{kp)  packets  must  pass  through  the 
0{p\og^^^^  p)  processors  with  an  equal  number  of  O’s  and  I’s  (or  one  more  0  than  1,  say, 
if  logp  is  odd),  which  implies  a  lower  bound  of  fi(Arlog^/^p)  time  steps  for  performing  k 
monotone  routes.  Since  every  monotone  route  can  be  decomposed  into  a  concentration  route 
followed  by  an  inverse  concentration  route,  and  these  operations  have  equal  complexity,  this 
lower  bound  applies  to  the  pipelined  concentration  and  inverse  concentration  operations  as 
well. 

A  pipelined  algorithm  for  merging  two  sorted  lists  X  and  Y  will  now  be  described.  Both 
X  and  Y  are  of  length  pAr,  and  the  algorithm  runs  on  p  processors.  The  algorithm  is  similar 
to  that  proposed  by  Varman  and  Doshi  [VD88],  but  is  somewhat  simpler.  The  optimal 
merging  algorithm  of  Anderson,  Mayr  and  Warmuth  for  the  EREW  PRAM  also  takes  a 
similar  approach  [AMW88].  For  simplicity,  it  will  be  assumed  that  all  of  the  2pk  input  keys 
are  distinct.  For  both  X  and  V,  the  keys  with  ranks  (numbered  from  0)  in  the  range  ik 
to  (i  -f-  1)A;  —  1  are  initially  stored  at  processor  i,  0  <  i  <  p.  The  two  ordered  sets  of  k 
keys  located  at  processor  i  wiU  be  referred  to  as  Xi  and  Yi,  respectively.  Let  Xi  denote  the 
least  element  of  X,,  and  let  yi  denote  the  greatest  element  of  Yi,  0  <  i  <  p.  Let  X'  and  Y' 
denote  the  set  of  aU  Xi^s  and  p,’s,  respectively.  Let  Z  denote  the  sorted  list  of  length  2pk 
that  results  from  merging  X  and  Y.  Those  elements  of  Z  with  ranks  in  the  range  2ik  to 
2{i  +  l)k  —  1,  denoted  must  be  routed  to  processor  i  by  the  end  of  the  computation, 
0  <  i  <  p,  and  must  be  sorted  locally. 
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The  approach  taken  is  first  to  merge  X'  and  Y',  and  then  to  use  the  resulting  list  to 
guide  the  merging  of  X  and  Y.  Let  Z'  denote  the  sorted  list  of  length  2p  that  results  from 
merging  X'  and  Y\  Let  Zj  denote  the  key  with  rank  j  in  Z',  0  <  j  <  2p.  Let  Zj  denote 
the  set  of  k  keys  associated  with  Zj,  that  is,  either  Zj  =  Xi  for  some  Xi  €  X'  and  Zj  =  X^, 
or  Zj  =  yi  for  some  yi  6  Y'  and  Zj  =  Yi*.  Note  that  if  Zj  6  X'  then  the  rank  of  Zj  in  Z  is 
between  jk  and  {j  +  l)fc  —  1,  inclusive.  The  exact  rank  of  Zj  in  Z  can  be  determined  by 
computing  its  rank  in  the  set  Yi,  where  yi  is  the  least  element  of  Y'  exceeding  Zj.  Similarly, 
if  Zj  e  Y'  then  the  rank  of  Zj  in  Z  is  between  jk  and  {j  +  1)A;  -  1,  and  the  exact  rank  of 
Zj  in  Z  depends  upon  the  set  Xt,  where  xi  is  the  largest  element  of  X'  that  is  less  than  Zj. 
Furthermore,  it  is  easy  to  check  that  the  set  Zj  is  contained  in  the  union  of  the 

set  Xt  corresponding  to  the  largest  Xi  that  is  less  than  Z2j,  and  the  set  Yi  corresponding  to 
the  smallest  yi  that  is  greater  than  Z2j-j-i .  These  observations  lead  to  the  following  pipelined 
merging  algorithm. 

Algorithm  Merge 

1.  Reverse  the  list  Y',  that  is,  route  yi  to  processor  p  —  z  —  1,  0  <  i  <  p.  This  takes  logp 
time  steps. 

2.  Merge  X'  and  Y'  by  simulating  a  bitonic  merge  over  2p  processors.  Record  the  data 
movements  to  facilitate  the  “unmerge”  of  Step  3.  This  takes  2 logp  time  steps. 

3.  Route  the  rank  of  each  key  in  Z'  back  to  the  processor  which  originally  held  that  key. 
This  can  be  done  in  2 logp  time  steps  by  following  the  paths  recorded  in  Step  2  in  the 
reverse  direction. 

4.  Route  each  set  X,  to  the  processor  that  held  Xi  after  Step  2,  0  <  i  <  p.  The  ID 
of  that  processor  can  be  computed  from  the  rank  received  by  processor  i  in  Step  3. 
The  routing  can  be  performed  in  2k  +  2 logp  time  steps  using  a  pipelined  inverse 
concentration.  Route  the  Y’s  in  a  similar  fashion,  for  a  total  cost  of  Ak  +  4 logp  time 
steps. 

5.  Assuming  the  set  X^  was  routed  to  processor  ji  in  the  previous  step,  broadcast  X,*  to 
all  processors  with  IDs  in  the  range  ji  +  1  to  0  <  z  <  p.  This  can  be  done  in 
2k  +  4logp  time  steps  with  a  single  application  of  the  Prefix'  operation,  as  described 
in  Section  2.2. 
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6.  Assuming  the  set  Yi  was  routed  to  processor  ji  in  the  previous  step,  broadcast  Yi  to 
all  processors  with  IDs  in  the  range  ji^i  to  ji  —  1,  0  <  i  <  p.  This  can  be  done  with 
a  single  application  of  a  ‘‘backwards”  version  of  Prefix',  and  takes  2k  +  4logp  time 
steps. 

7.  At  this  point,  processor  j  contains  a  copy  of  largest  Xi  with  Xi  <  Z2j 

and  the  smallest  Yi  with  yi  >  Q  <  j  <  p.  As  observed  above,  the  union  of  these 

sets  contains  the  desired  set  Zj,  and  the  keys  to  be  discarded  (i.e.,  those  not  belonging 
to  Zj)  can  be  determined  by  computing  the  exact  rank  of  either  Z2j  or  Z2jj^i-  These 
sets  can  be  merged,  and  the  rank  computation  performed,  with  0{k)  local  operations. 
The  definition  of  a  time  step  allows  these  local  operations  to  be  interleaved  with  the 
computations  of  Steps  5  and  6  at  no  extra  cost. 

Note  that  only  Step  4  uses  the  power  of  the  pipelined  model.  The  total  running  time  of 
Merge  is  8k  +  17 log p  time  steps.  Now  consider  the  case  in  which  2p  processors  are  available 
to  perform  the  merge,  where  it  is  assumed  that  Xi  is  initially  stored  at  processor  f,  Yi  is 
initially  stored  at  processor  2p  —  i  —  1,  and  Zj  is  to  be  output  at  processor  i,  0  <  f  <  p, 
0  ^  J  <  2p.  In  this  case,  Step  1  is  unnecessary,  and  the  cost  of  each  of  Steps  2,  3  and  4  is 
halved,  while  the  cost  of  the  remaining  steps  is  unchanged.  Thus,  the  total  cost  of  Merge 
with  2p  processors  is  6k  +  121ogp  time  steps.  Note  that  for  k  =  fi(logp),  this  running  time 
is  within  a  constant  factor  of  optimal.  Furthermore,  as  observed  by  Varman  and  Doshi, 
this  optimal  merging  routine  immediately  implies  an  optimal  algorithm  for  sorting  when 
the  number  of  keys  to  be  sorted,  n,  exceeds  the  number  of  processors,  p,  by  a  factor  k  that 
is  fi(logp).  The  idea  is  to  sort  the  set  of  k  keys  at  each  processor  locally,  and  then  to  merge 
sorted  subcubes  repeatedly  until  the  entire  hypercube  has  been  sorted.  At  each  level,  even 
subcubes  are  sorted  in  ascending  order  and  odd  subcubes  are  sorted  in  descending  order. 
The  running  time  of  this  algorithm,  which  will  be  referred  to  as  MergeSort,  is 

{6k  +  12i)  =  6k  logp  +  O(log^  p). 

0<t<logp 

As  mentioned  above,  this  running  time  is  optimal  for  k  =  12(logp). 

A  pipelined  version  of  the  multi-way  merging  procedure  of  Neissimi  and  Sahni  [NS82] 
that  runs  on  the  pipelined  hypercube  will  now  be  described.  The  input  consists  of  2^  sorted 
lists  of  length  k2”^,  and  the  output  is  a  single  sorted  list  of  length  A:2^ The  merging  is 
performed  in  0{k  +  logp)  time  steps  on  a  hypercube  with  p  =  processors.  Let  the  ith 


20 


CHAPTER  2.  PIPELINING 


input  list  be  denoted  0  <  i  <  2^  and  let  the  set  of  k  elements  of  with  ranks  between 
jk  and  (j  +  1)A;  -  1  (inclusive)  be  denoted  Xj-,  0  <  j  <  2^.  The  set  Xj  is  initially  stored  at 
processor  i2”^  +  j.  Let  the  output  list  be  denoted  X.  At  the  end  of  the  merging  process, 
the  elements  of  X  with  ranks  between  jk  and  {j  +  l)k  —  1  (inclusive)  should  be  stored  at 
processor  j,  0  <  jf  < 


It  is  useful  to  view  the  processors  of  the  given  hypercube  as  forming  a  2^  by  2^"^^  array, 
where  the  processor  in  row  i  and  column  j  has  ID  +  j  (row-major  order).  Note  that 

aU  of  the  Xj’s  are  stored  in  row  0.  In  fact,  each  processor  in  row  0  contains  exactly  one  set 


The  algorithm  makes  use  of  pipelined  broadcast  and  sum  operations  over  entire  sub¬ 
cubes.  Formally,  a  pipelined  broadcast  operation  takes  k  keys  stored  at  a  single  processor 
and  broadcasts  them  over  the  entire  subcube.  For  a  pipelined  sum  operation,  processor  i 
initially  holds  k  keys  aij,  0<i<p,  Q<j<k.  The  output  is  the  k  sums  J2o<i<p^ijy 
0  <  J  <  A;,  all  of  which  are  output  at  a  single  designated  processor.  Although  such  oper¬ 
ations  can  be  performed  using  Prefix,  other  implementations  exist  which  are  more  efficient 
by  a  constant  factor.  For  example,  using  the  multiple  spanning  binomial  tree  (MSBT) 
embedding  of  Ho  and  Johnsson  [HJ86]  it  is  possible  to  perform  k  broadcasts  in  A;  -|-  logp 
time  steps.  Similarly,  k  sums  can  be  performed  in  k  +  logp  time  steps.  Note  that  although 
these  operations  are  pipelined,  they  run  on  the  1-port  hypercube  and  thus  do  not  require 
the  additional  power  of  the  pipelined  hypercube. 


Algorit  hm  M  u  It  i Way  M  erge 

1.  Broadcast  Xj  to  all  of  the  processors  in  column  z2”^  +  j,  0  <  i  <  2^  0  <  j  <  2”^.  Each 
of  the  columns  is  an  independent  subcube  of  dimension  /.  Thus,  the  broadcasts  can 
be  performed  in  A:  -|-  /  time  steps  using  an  MSBT  embedding  within  each  column. 

2.  Replicate  list  X*  across  the  ith  row,  0  <  i  <  2^  In  other  words,  route  a  copy  of 
Xj  to  each  column  of  the  ith  row  that  is  congruent  to  j  mod  2^.  This  amounts  to 
performing  pipelined  broadcasts  over  subcubes  of  dimension  /,  which  can  be  done  in 
k  +  I  time  steps  using  the  MSBT  embedding. 

3.  Merge  the  lists  X*  and  X^  using  the  jth  block  of  2”^  processors  of  row  i  (i.e,,  columns 
j2'^  to  {j  +  1)2^^  -  1)?  0  <  i,j  <  2^  i  /  J.  This  takes  8A:  +  17m  time  steps. 
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4.  In  the  jth  block  of  2'^  processors  of  row  i,  “unmerge”  the  rank  of  each  element  of 

in  (this  is  the  rank  of  that  key  in  X*  U  X^  minus  its  rank  in  X,),  0  <  j  <  2^, 
i  ^  j.  In  other  words,  route  the  rank  of  each  key  back  to  the  processor  that  contained 
the  key  before  Step  3.  This  is  a  pipelined  inverse  concentration,  and  can  be  performed 
ink  +  m  time  steps.  Where  i  =  j,  simply  label  each  key  with  its  rank  in  X\ 

5.  Compute  the  rank  of  every  key  in  X.  The  processors  of  row  i  are  used  to  perform  this 
computation  for  the  elements  of  the  set  X*,  0  <  z  <  2^  For  each  set  Xj,  a  pipelined 
sum  is  performed  over  a  subcube  of  dimension  /,  adding  the  ranks  computed  in  Step  4 
and  routing  the  results  to  the  first  block  of  2"^  processors  in  each  row.  This  takes  k  +  l 
time  steps  using  the  MSBT  embedding. 

6.  In  row  i,  route  the  elements  of  Xi  to  the  correct  output  column  (given  by  the  floor 
of  the  rank  computed  in  Step  5  divided  by  A:),  0  <  i  <  2^  This  is  a  pipelined  inverse 
concentration  in  a  subcube  of  dimension  /  4-  m,  and  takes  k  +  I  +  m  time  steps. 

7.  Each  column  of  the  array  now  contains  k  keys.  Route  these  keys  to  the  top  of  the 
column  (row  0).  In  terms  of  data  paths,  this  is  essentially  an  inverse  pipelined  broad¬ 
cast  operation  over  a  subcube  of  dimension  /,  and  it  can  be  performed  in  k  +  I  time 
steps  using  the  MSBT  embedding. 

Only  Steps  3,  4  and  6  require  the  power  of  the  pipelined  hypercube.  Summing  all 
of  the  costs  stated  above,  the  total  running  time  of  MultiWay Merge  is  readily  seen  to  be 
14A:  +  5/  +  19m  time  steps. 

Repeated  application  of  MultiWayMerge  on  successively  larger  subcubes  leads  to  a  fast 
sorting  algorithm  for  the  case  n  <  plogp.  The  running  time  of  this  algorithm,  which  wiU 
be  referred  to  as  MultiWay MergeSort,  will  be  shown  to  be  O(log^p/log((j?logp)/n))),  as 
opposed  to  0{log^  p/ log{p/n))  for  the  sorting  algorithm  of  Nassimi  and  Sahni.  For  the 
interesting  case  n  =  p^  the  running  time  of  MultiWayMergeSort  is  O(log^p/loglogp),  a 
slight  asymptotic  improvement  over  that  of  Batcher’s  bitonic  sort.  It  must  be  emphasized, 
however,  that  MultiWayMergeSort  only  runs  on  the  pipelined  hypercube. 

A  more  formal  description  of  the  MultiWayMergeSort  algorithm  wiU  now  be  given,  along 
with  an  analysis  of  its  time  complexity.  The  algorithm  is  designed  to  sort  n  =  k2^  keys  on 
a  hypercube  with  p  =  2^“*"^  processors.  It  is  useful  to  view  the  processors  as  being  arranged 
in  a  2^  by  2^  array,  where  the  processor  in  row  i  and  column  j  has  ID  2*2’^  +  j  (row-major 
order). 
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Algorithm  MultiWayMergeSort 

1.  Each  column  of  the  array  contains  k  keys.  Route  all  of  these  to  the  top  of  the  column 
(row  0).  As  in  Step  7  of  MultiWayMerge,  this  takes  k  +  I  time  steps. 

2.  At  every  processor  in  row  0,  sort  the  set  of  k  keys  using  an  efficient  sequential  sorting 
routine.  This  takes  0{klogk)  time  steps. 

3.  Repeatedly  call  MultiWayMerge.  The  length  of  the  sorted  lists  increases  by  a  factor  of 
2^  after  each  call.  Thus,  after  \m/!\  iterations  all  of  the  keys  have  been  sorted.  The 
cost  of  the  ith  iteration  is  14k  +  5/  +  19i/  time  steps,  for  a  total  cost  of  approximately 
(14/:  +  4/  +  12m)m/l  time  steps. 

4.  The  keys  have  been  sorted,  but  they  are  not  configured  appropriately  (i.e.,  all  of  the 
keys  are  in  row  0).  All  of  the  keys  can  be  routed  to  the  correct  output  locations  using 
k  pipelined  inverse  concentration  routes,  which  takes  k  +  logp  time  steps. 

Steps  3  and  4  make  use  of  the  power  of  the  pipelined  hypercube.  The  total  running 
time  of  MultiWayMergeSort  is  minimized  (to  within  a  constant  factor)  by  setting  k  =  logp, 
and  for  this  choice  of  k  the  running  time  is  dominated  by  the  cost  of  Step  3.  Observ¬ 
ing  that  /  =  log{pk/n)  and  m  =  logp  —  /  <  logp,  one  finds  that  for  k  =  logp  the  al¬ 
gorithm  runs  in  ^^og^ p/\og{{plogp)/n)  +  O(logp)  time  steps.  For  the  case  n  =  p,  k 
can  be  set  to  logp/ log  logp  in  order  to  reduce  the  dominant  term  in  the  running  time  to 

f  log^P/ log  logp- 


2.5  Summary 

This  chapter  has  described  simple  and  efficient  pipelined  implementations  for  the  Prefix 
operation  on  the  complete  inorder  binary  tree,  hypercube  and  shuffle-exchange  families  of 
networks.  Since  UUman’s  data  distribution  primitive  may  be  viewed  as  a  special  case  of 
the  Prefix  operation,  these  results  immediately  yield  a  pipelined  implementation  for  that 
primitive.  A  variant  of  the  Prefix  operation  was  used  to  obtain  a  simplified  implementation 
of  Varman  and  Doshi’s  optimal  merging  algorithm  for  the  pipelined  model  of  the  hypercube. 

In  order  to  better  cissess  the  practical  speed  of  the  various  algorithms  presented  in  this 
paper,  the  coefficient  on  the  leading  term  of  the  running  time  has  been  computed  in  each 
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case.  It  is  quite  possible  that  one  or  more  of  the  moderately  large  coefficients  in  Section  2.4 
could  be  improved  with  only  minor  modifications  to  the  code. 

It  should  be  mentioned  that  for  permutation  routing,  an  important  special  case  of  the 
sorting  problem,  there  is  a  much  simpler  O(log^p/loglogp)  time  algorithm  for  the  case 
n  =  p  than  MultiWayMergeSort  [Pel].  The  idea  is  to  route  packets  in  a  greedy  fashion  over 
sets  of  log  log  p  dimensions  at  a  time.  Each  set  of  routings  produces  a  load  balancing  problem 
in  which  there  may  be  as  many  as  logp  packets  at  any  one  processor,  and  the  objective 
is  to  redistribute  the  packets  so  that  there  is  exactly  one  at  each  processor.  Section  4.1.1 
demonstrates  how  this  redistribution  can  be  performed  in  O(logp)  time  on  the  pipelined 
hypercube  by  making  use  of  the  pipelined  prefix,  broadcast  and  concentration  operations 
described  in  this  chapter. 


Chapter  3 


Boolean  Matrix  Multiplication 


This  chapter  considers  processor-efficient,  optimal  time  implementations  of  the  elemen¬ 
tary  Boolean  matrix  multiplication  algorithm  for  the  hypercube  and  shuffie-exchange.  The 
phrase  “elementary  matrix  multiplication  algorithm”  refers  to  the  standard  O(n^)  time 
sequential  algorithm  for  computing  the  product  of  two  n  X  n  matrices,  as  opposed  to 
asymptotically  faster  (but  in  most  cases  less  practical)  methods  due  to  Coppersmith  and 
Winograd  [CW82][CW87],  Schonhage  [Sch82],  and  Strassen  [Str69][Str86]. 

The  problem  of  implementing  general  matrix  multiplication  on  the  hypercube  and 
shuffle-exchange  was  studied  extensively  by  Dekel,  Nassimi  and  Sahni  [DNS81].  For  the 
special  case  of  Boolean  matrix  multiplication,  Agerwala  and  Lint  have  given  a  paral¬ 
lel  implementation  of  the  four  Russians’  algorithm  which  runs  in  O(logn)  time  using 
O(n^/(lognloglogn))  processors  [AL78].  This  chapter  provides  O(logn)  time  hypercube 
and  shuffle-exchange  implementations  of  Boolean  matrix  multiplication  that  improve  this 
processor  bound  to  (9(n^/(log^  nloglogn)). 


3.1  The  Basic  Algorithm 

Let  n  xn  Boolean  matrices  A  =  (ctij)  and  B  =  (bij)  be  given.  Letting  C  ==  (c^j)  denote  the 
matrix  A5,  the  entries  of  C  are  given  by  the  elementary  formula 

”  \l  ^  b}cj<f  (^*1) 

0<A;<n 
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0  ^  id  <  This  relationship  leads  to  a  simple  and  well-known  0(log7i)  time  matrix 
multiplication  algorithm  running  on  a  hypercube  with  processors.  Some  notation  will 
be  introduced  before  describing  this  algorithm. 

Given  a  hypercube  of  dimension  d,  let  every  string  a  of  length  d  over  the  alphabet 
{0, 1,*}  correspond  to  that  set  of  processors  for  which  the  ID  ‘‘matches”  a  in  the  natural 
sense.  For  example,  in  a  hypercube  of  dimension  4,  the  string  *1*0  corresponds  to  the  4 
processors  0100,  0110,  1100  and  1110.^  It  is  often  convenient  to  specify  such  a  d-bit  string 
as  a  tuple  of  the  form  (t^o  :  ooi  •••5  where  t  and  the  Wi^s  are  nonnegative 

integers,  YloKict'^i  ~  either  *  or  a  Wi-hit  integer.  As  one  might  suspect,  such 

a  tuple  is  intended  to  correspond  to  the  string  /3o-“A-i,  where  /3i  is  the  tUt-bit  string 
corresponding  to  the  binary  representation  of  ai  if  *,  and  *^*  otherwise.  For  example, 
the  tuple  (3  :  6,  4  :  *,  1  :  0)  corresponds  to  the  string  110****0. 

The  basic  O(logn)  algorithm  for  Boolean  matrix  multiplication  on  a  hypercube  with 
processors  can  now  be  easily  described.  Let  x  =  log  n,  and  note  that  each  processor  has  a  3x- 
bit  ID.  Assume  that  input  bits  aij  and  bij  are  initially  stored  in  processor  (a;  :  i,  x  :  j,  x  :  0). 

After  broadcasting  over  ID  bits  [0,x)  (the  rightmost  field),  a  copy  of  aij  resides  in  each 
processor  in  the  set  (x  :  i,  x  :  j,  *).  Hence,  it  certainly  resides  in  the  particular  processor 
(x  :  2,  X  :  j,  X  :  y),  and  by  broadcasting  over  bits  [x,2x),  a  copy  of  a^j  can  be  sent 
to  all  processors  of  the  form  (x  :  i,  x  :  *,  x  :  j).  Similarly,  bij  can  be  routed  to  the 
set  of  processors  (x  :  *,  x  :  x  :  i)  in  0(x)  time.  At  this  point,  note  that  processor 
(x  :  2,  X  :  j,  X  :  k)  contains  aik  and  0  <  <  n.  Thus,  the  Boolean  AND 

operations  of  Equation  (3.1)  can  be  performed  in  a  single  time  step.  The  c^j’s  can  now  be 
computed  by  simply  ORing  over  ID  bits  [0,x).  This  takes  0{x)  time  and  leaves  C{j  in  the 
desired  output  processor  (x  :  2,  x  :  x  :  0).  Thus,  the  entire  algorithm  runs  in  O(logn) 
time,  as  claimed.  Furthermore,  it  can  be  easily  adapted  to  run  on  the  shuffle-exchange  in 
the  same  asymptotic  time  bound. 

3.2  A  Simple  Improvement 

This  section  describes  simple  modifications  to  the  preceding  algorithm  that  reduce  the 
processor  requirement  to  0(n^/ log^  n)  without  affecting  the  asymptotic  time  bound.  Using 

^Note  that  a  string  with  s  occurences  of  the  symbol  ♦  corresponds  to  a  subcube  of  dimension  s  (hence, 
2^  processors). 
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a  more  complicated  scheme,  Section  3.3.1  will  reduce  the  processor  requirement  by  a  further 
factor  of  log  log  n. 

Let  X  =  logn  and  let  y  —  log  log  n.  For  simplicity,  assume  that  y  is  an  integer.  The 
algorithm  of  Section  3.1  will  now  be  modified  to  run  without  (more  than  a  constant  factor) 
slowdown  on  a  hypercube  of  dimension  3x— 2y.  In  the  modified  algorithm,  physical  processor 
p  simulates  the  subcube  (3x  —  2y  :  p,  2y  :  +)  in  a  hypercube  of  dimension  3a:  running 
the  basic  algorithm.  Recall  that  the  basic  algorithm  routes  a  copy  of  aij  to  the  subcube 
{x  :  2,  a:  :  ar  :  j).  In  the  modified  algorithm,  this  computation  is  simulated  as  follows. 

1.  By  broadcasting  over  dimensions  [0,a:  —  y),  a  copy  of  the  bit  aij  can  be  sent  to  each 
processor  in  the  set  (a: :  i,  x  :  j,  x  -2y  :  *).  This  takes  0{x)  time. 

2.  From  the  previous  step,  processor  (ar  :  i,  x  :  j,  x  —  2y  :  j  div  x^)  certainly  holds  a 
copy  of  bit  aij.  By  broadcasting  over  dimensions  [x,2x  —  2y),  this  copy  of  bit  aij  can 
be  sent  to  each  processor  in  the  set  (x  :  i,  x  -  2y  :  2y  :  j  mod  x^,  x  —  2y:j  div  x^). 
This  takes  0(x)  time. 

3.  At  this  point,  every  processor  contains  a  single  entry  from  the  matrix  A.  This  data 

may  be  viewed  as  a  bit  vector  of  length  1.  By  repeatedly  concatenating  the  bit  vector 
held  at  each  processor  with  that  of  its  neighbor  over  dimensions  [x  —  y,x),  every 
processor  in  the  set  (x  :  2,  x  —  y  :  y  :  j  mod  x,  x  —  2y  :  j  div  x^)  ends  up  with  a 

copy  of  the  x  entries  of  A,  packed  into  a  single  (or,  at  least,  some  constant  number 
of)  O(logn)-bit  registers.  This  takes  0(y)  time. 

4.  The  length  of  the  bit  vector  at  each  processor  is  now  x  =  logn,  and  as  the  bit 
vector  held  at  each  processor  is  repeatedly  concatenated  with  that  of  its  neighbor 
over  dimensions  [x  —  2y,  x  -  y),  the  number  of  memory  words  required  to  represent  a 
bit  vector  doubles  at  each  iteration.  Thus,  the  amount  of  time  required  to  complete 
each  successive  iteration  also  doubles,  and  the  total  time  is  proportional  to  the  length 
in  words  of  the  bit  vectors  after  the  last  iteration.  Since  there  are  y  iterations,  the  bit 
vectors  reach  a  length  of  x  words,  and  the  total  time  required  to  perform  this  set  of 
concatenations  is  0{x), 

The  preceding  algorithm  requires  that  each  processor  be  capable  of  concatenating  two 
O(logn)  bit  operands  in  constant  time.  This  could  be  accomplished  by  performing  an 
appropriate  shift  (or  multiply)  operation  followed  by  an  OR.  The  model  of  computation 
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defined  in  Section  1.1  is  not  violated  since  the  operands  never  exceed  O(logn)  bits  in 
length. 

The  end  result  of  applying  the  above  procedure  (which  runs  in  0{x)  time)  is  that  each 
processor  in  the  subcube  5  =  (x  :  x  —  2y  :  j  div  x^)  holds  the  appropriate  set 

of  bits,  namely,  aik  for  those  integers  k  given  by  the  tuple  (x  —  2y  :  j  div  x^,  2y  :  *). 
These  bits  are  stored  in  x  words  of  length  x  and,  assuming  that  the  concatentations  have 
been  performed  in  an  appropriate  manner,  a  copy  of  bit  is  stored  in  bit  position  r  of 
word  s  at  each  processor  in  the  set  5,  where  k  is  the  unique  integer  given  by  the  tuple 
(x  -  2y  :  j  div  x^,  y  :  v,  y  :  s). 

It  wiU  be  convenient  to  specify  certain  sets  of  bit  locations  using  tuple  notation.  In  order 
to  avoid  confusion  between  sets  of  processors  and  sets  of  bit  locations,  square  brackets  will 
be  used  to  denote  bit  locations.  Let  the  bit  location  corresponding  to  position  r  of  word 
s  at  processor  p  be  denoted  [x  —  2y  :  p,  y  :  r,  y  :  s].  Using  this  notation,  it  should  be 
apparent  that  the  sequence  of  operations  described  above  places  a  copy  of  aij  in  the  set  of 
“matrix  A”  bit  locations  given  by  [x  :  i,  x  :  x  :  j],  A  similar  procedure  can  be  used  to 
route  bij  to  the  set  of  “matrix  5”  bit  locations  given  by  [x  :  x  :  y,  x  :  i].  The  AND 
operations  of  Equation  (3.1)  can  now  be  performed  in  x  time  steps  by  ANDing  together 
the  X  corresponding  pairs  of  matrix  A  and  matrix  B  words  at  each  processor.  In  order  to 
compute  the  Ct/s  efficiently,  each  processor  first  locally  ORs  together  the  log^p  bits  that  it 
contains.  This  reduces  the  amount  of  relevant  data  to  a  single  bit  per  processor,  and  takes 
0{x)  time.  The  remaining  OR  operations  are  performed  over  ID  bits  [0,x  —  2y]  as  in  the 
basic  algorithm. 

Unlike  the  basic  algorithm  of  Section  3.1,  implementing  the  algorithm  described  in  this 
section  on  the  shuffle-exchange  so  that  it  runs  in  O(log7i)  time  is  not  entirely  straightfor¬ 
ward.  The  problem  is  that  once  the  data  corresponding  to  array  A,  say,  has  been  replicated 
to  the  point  that  every  processor  contains  O(log^p)  bits,  the  algorithm  cannot  afford  to 
shuffle  the  data  more  than  a  constant  number  of  bit  positions.  Hence,  the  data  must  be 
aligned^  correctly  just  as  the  processors  become  “saturated”.  A  second  requirement  is  that 
the  data  corresponding  to  arrays  A  and  B  be  aligned  in  the  same  way.  Finally,  the  ORing 
will  be  too  expensive  unless  the  shuffle-exchange  is  aligned  in  such  a  way  that  the  “A;  field” 
(the  bit  positions  corresponding  to  k  in  Equation  (3.1))  is  a  constant  number  of  shuffles  away 
from  the  exchange  position.  It  is  not  hard  to  prove  that  these  three  requirements  cannot 

^The  “alignment”  of  the  data  corresponds  to  the  net  number  of  shuffle  operations  that  have  been  applied, 
modulo  logp. 
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be  simultaneously  satisfied  if  the  i,  j  and  k  fields  each  consist  of  a  contiguous  set  of  ID  bits. 
One  solution  to  this  dilemma  is  to  interleave  the  embedding  of  the  i,  j  and  k  fields  among 
the  logp  ID  bits.  For  example,  bit  positions  [0,3®  —  6y)  can  be  alternately  assigned  to  i,  j 
and  k  (in  ascending  order  of  significance),  and  the  bit  positions  [3®  —  6y,3®  —  2y)  can  be 
alternately  assigned  to  the  remaining  bits  of  i  and  j.  Note  that  it  is  actually  not  necessary 
to  interleave  the  entire  address  space  in  order  to  allow  the  three  alignment  requirements  to 
be  simultaneously  satisfied. 


3.3  The  Four  Russians’  Algorithm 

The  elementary  sequential  algorithm  for  multiplying  two  n  x  n  Boolean  matrices  requires 
0{n^)  bit  operations.  Arlazarov,  Dinic,  Kronrod  and  Faradzev  gave  a  practical  algorithm 
that  reduces  this  number  of  bit  operations  to  0(n^ /log  n)  [ADKF70].  A  detailed  descrip¬ 
tion  of  the  so-called  Four  Russians’  algorithm  may  be  found  in  Aho,  Hopcroft  and  Ull- 
man  [AHU74];  the  following  section  will  assume  that  the  reader  is  familiar  with  the  Four 
Russians’  algorithm.  Note  that  under  a  uniform  cost  criterion,  assuming  0(logn)-bit  reg¬ 
isters,  the  Four  Russians’  algorithm  can  be  easily  modified  to  run  in  0(n^/log^  n)  time. 

3.3.1  Parallel  Four  Russians’ 

The  problem  of  parallelizing  the  Four  Russians’  Boolean  matrix  multiplication  algorithm 
was  considered  previously  by  Agerwala  and  Lint  [AL78],  who  exhibited  an  algorithm  that 
runs  in  O(logn)  time  on  a  network  with  O(n^/(log7zloglogn))  processors.  Note  that  the 
simple  algorithm  of  Section  3.2  already  yields  an  improvement  over  the  result  of  Agerwala 
and  Lint.  The  purpose  of  this  section  is  to  establish  that  the  Four  Russians’  approach  can 
be  combined  with  the  techniques  of  Section  3.2  in  order  to  obtain  O(logn)  time  algorithms 
for  the  hypercube  and  shuffle-exchange  using  only 

0  ( — - 

y  log  n  log  log  n 

processors.  The  hypercube  algorithm  will  now  be  presented  in  detail,  followed  by  an  indi¬ 
cation  of  the  modifications  necessary  to  achieve  the  same  asymptotic  performance  on  the 
shuffle-exchange. 

Let  a:,  y  and  z  denote  log  n,  log  log  n  and  log  log  log  n,  respectively.  In  order  to  simplify 
the  exposition,  y  and  2:  will  be  assumed  to  be  integers  and  round-off  errors  will  be  ignored. 
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[x  :  i,  X  :  j,  X  -  z  :  0] 

— )■  [a:  :  i,  x  :  j,  x  —  2y  —  z  :  y  -  z  :  0^  z  :  y  :  0] 

— [x  :  i,  X  —  z  :  j  div  y,  x  ~  2y  :  y  —  ^  :  0,  2:  :  j  mod  y,  y  :  0] 

— >•  [x  :  2,  X  —  2y  :  x  —  z  :  j  div  y,  y  —  2  :  0,  z  :  j  mod  y,  y  :  0] 

— ^  [x  -  y  :  2  div  x,  x  -  y  :  x  -  z  :  j  div  y,  y  -  z  :  0,  z  :  j  mod  y^  y  :  i  mod  x] 

Figure  3.1:  The  path  followed  by  the  a^^’s. 

For  example,  it  will  be  assumed  that  n/y  is  an  integer.  It  is  straightforward  to  verify  that 
the  algorithm  can  be  modified  to  handle  round-off  errors. 

The  task  at  hand  is  to  multiply  two  n  x  n  Boolean  matrices  in  O(log7i)  time  on  a 
hypercube  with  0(n^/(x^y))  processors,  that  is,  a  hypercube  of  dimension  3x  —  2y  —  z.  As 
in  Section  3.1,  it  will  be  assumed  that  input  bits  aij  and  bij  are  initially  stored  at  processor 
(x  :  2,  X  :  y ,  x  —  2y  —  z  :  0),  and  that  output  bit  Cij  should  appear  at  the  same  processor, 

0  <  2,J  <  72. 

As  in  Section  3.2,  each  processor  will  store  up  to  x^  bits  of  data  at  some  point  during 
the  computation.  The  bit  location  corresponding  to  position  r  of  word  s  at  processor  p  will 
be  denoted  [x  —  2y  —  z  :  p,  y  :  r,  y  :  5]. 

As  in  Sections  3.1  and  3.2,  the  first  phase  of  the  algorithm  consists  of  permuting  and 
replicating  the  elements  of  the  two  input  arrays  A  and  B,  Figure  3.1  indicates  the  four  stage 
path  followed  by  the  a^j’s  during  the  first  phase.  Note  that  the  low  order  y  bits  remain  0 
until  the  last  stage,  implying  that  there  is  only  one  word  of  relevant  data  at  each  processor 
during  the  first  three  stages.  Thus,  the  first  three  stages  take  a  total  of  0(x)  time.  The  final 
stage  builds  up  x-word  bit  vectors  at  each  processor,  and  also  takes  0(x)  time.  Figure  3.2 
gives  the  somewhat  more  complicated  seven  stage  path  followed  by  the  6tj’s  during  the  first 
phase.  Once  again,  one  may  verify  that  the  total  cost  of  all  stages  is  0(x). 

It  remains  to  show  how  to  compute  the  c^j’s  in  0(x)  time  given  that  the  arrays  A  and 
B  are  stored  in  the  manner  indicated  by  the  last  tuples  of  Figures  3.1  and  3.2,  respectively. 
At  the  end  of  the  first  phase,  the  processor  given  by  the  tuple  (x  —  y  :  r,  x  —  y  :  s,  x  —  z  :  t) 
holds  the  2xy  array  elements  aik  and  b^j  for  rx  <  i  <  (r  +  l)x,  sx  <  j  <  (s  +  l)x,  and 
ty  <  k  <  {t  +  l)y.  In  the  second  phase  of  the  algorithm,  the  task  of  this  processor  is  to 
perform  the  set  of  AND  and  OR  operations  associated  with  this  set  of  array  elements.  The 
Four  Russians’  technique  is  used  to  perform  these  operations  in  0(x)  time,  as  follows.  Note 
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[x  :  X  :  X  —  z  :  0] 

— ^[x  :i,  X  :  j,  X  -2y  -  z  :  y:*,y:Q] 

— [x  :  z,  X  -  y  :  j  div  x,  x  —  y  -  z  :  y  :  j  mod  x,  y  :  0] 

— [x  —  y  —  z  :i  div  xy,  y  :  z  :  i  mod  y,  x  —  y  :  j  div  x, 

X  —  2y  ~  y  :  {i  mod  xy)  div  y,  y  :  j  mod  x,  y  :  0] 

— ^  [x  —  y  —  z  :i  div  xy,  z  :  i  mod  y,  y  :  x  —  y  :  j  div  x, 

X  —  2y  —  y  :  (i  mod  xy)  div  y,  y  :  j  mod  x,  y  :  0] 

— ^  [x  —  y  —  z  :i  div  xy,  ^  :  i  mod  y,  x  —  y  :  j  div  x, 

X  -  y  -  ^  y  :  (i  mod  xy)  div  y,  y  :  j  mod  x,  y  :  0] 

— ^  [x  —  y  —  z  :  i  mod  y,  x  —  y  :  j  div  x^  x  —  z  :i  div  y,  y  :  j  mod  x,  y  :  0] 

— ^  [x  —  y  :  X  —  y  :  j  div  x,  x  —  z  \  i  div  y,  y  :  jf  mod  x,  y  —  ^  :  0,  z  \  x  mod  y] 

Figure  3.2:  The  path  followed  by  the  6tj’s. 

that  the  s^re  stored  in  y  words  of  x  bits  apiece.  There  are  2^  =  x  possible  words  that 
can  be  obtained  by  ORing  together  a  subset  of  these  y  words.  A  table  T  of  these  x  words  is 
computed,  where  the  /th  entry  in  the  table  (denoted  r(/))  corresponds  to  the  subset  given 
by  the  binary  representation  of  /.  For  example,  if  y  =  4  and  1  =  1  =  OIII2,  then  T{1)  is 
obtained  by  ORing  together  words  0,  1  and  2  (but  not  word  3).  Note  that  the  table  T  can 
be  constructed  in  0{x)  time. 

The  motivation  for  computing  T  is  that  now  the  x  Boolean  values  given  by 

V  ^  ^^  <  ^  <  (^  +  1)^? 

can  be  computed  in  a  single  table  lookup  for  any  fixed  value  of  sx  <  j  <  {s  +  l)x. 
Namely,  one  may  check  that  these  bits  are  given  by  T{u)^  where  u  is  the  y-bit  integer 
Uy^i  •  •  •  uo  with  Uv  —  Note  that  the  first  phase  has  already  constructed  the  word  u 

at  the  appropriate  processor.  Thus,  0(x)  time  suffices  to  perform  all  of  the  AND  and  OR 
operations  local  to  any  particular  processor. 

The  third  phase  of  the  algorithm  consists  of  performing  the  remaining  OR  operations  and 
routing  the  c,j’s  to  the  appropriate  output  processors.  At  the  beginning  of  the  third  phase, 
each  processor  holds  x^  relevant  bits  of  information.  Unlike  the  algorithm  of  Section  3.2,  the 
bits  cannot  be  ORed  together  locally  in  order  to  immediately  reduce  the  amount  of  data  at 
each  procesor  to  a  single  bit.  The  reason  is  that  the  2y  dimensions  being  simulated  within 
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the  processors  correspond  to  the  i  and  j  fields,  rather  than  to  the  k  field.  However,  the 
amount  of  data  per  processor  can  stiU  be  reduced  geometrically  by  ORing  across  appropriate 
physical  dimensions  in  the  k  field.  This  takes  0(a:)  time,  and  once  the  data  has  been  reduced 
to  a  single  bit  per  processor,  the  remainder  of  the  third  phase  can  be  easily  completed  in 
0{x)  time.  Hence,  the  overall  running  time  of  the  algorithm  is  0(a:),  as  claimed. 

As  in  Section  3.2,  it  is  possible  to  adapt  this  algorithm  to  run  in  0{x)  time  on  the 
shuffle-exchange  by  appropriately  interleaving  the  embedding  of  the  i,  j  and  k  fields  among 
the  logp  ID  bits.  The  details  are  left  to  the  reader. 


3.4  Summary 

This  chapter  has  presented  O(logn)  time,  O(ra^/(log^  nloglogra))  processor  implementa- 
tions  of  the  elementary  Boolean  matrix  multiplication  algorithm.  Interleaving  fields  of  ID 
bits  leads  to  efficient  performance  on  the  shuffle-exchange;  this  technique  may  be  more  gen¬ 
erally  useful.  Note  that  the  0(n^/log^  n)  processor  implementation  of  Section  3.2  pipelines 
the  standard  Boolean  matrix  multiplication  algorithm  to  the  maximum  possible  extent.  To 
see  this,  note  that  there  are  0(n^)  bit  operations  to  be  performed,  and  that  each  processor 
can  perform  at  most  O(logn)  bit  operations  per  time  step  (the  register  size),  and  hence  at 
most  O(log^n)  bit  operations  in  O(logn)  time. 

While  the  sequential  Four  Russians’  algorithm  saves  a  factor  of  log  n  time,  the  parallel 
version  described  in  this  chapter  (as  well  as  that  of  Agerwala  and  Lint  [AL78])  reduces 
the  processor  requirement  by  only  a  log  log  n  factor.  The  reason  for  this  is  that  a  parallel 
algorithm  that  runs  in  O(logn)  time  can  only  afford  to  build  tables  of  length  O(logn)  at 
each  processor.  The  tables  constructed  by  the  sequential  Four  Russians’  algorithm  are  of 
length  n,  allowing  logn  bit  computations  to  be  performed  by  a  single  table  lookup. 

The  algorithm  of  Section  3.2  was  coded  up  on  an  NCUBE/ten  parallel  computer  as  a 
sample  application  program  within  a  virtual  processing  environment  [Pia87], 


Chapter  4 


Load  Balancing 


Maintaining  a  balanced  load  is  of  fundamental  importance  on  any  parallel  computer,  since  a 
strongly  imbalanced  load  often  leads  to  low  processor  utilization.  In  this  chapter,  two  load 
balancing  operations  will  be  considered:  Balance  and  MultiBalance.  The  Balance  operation 
corresponds  to  the  token  distribution  problem  considered  by  Peleg  and  Upfal  [PU89]  for 
certain  expander  networks.  The  MultiBalance  operation  balances  several  populations  of 
distinct  token  types  simultaneously. 

These  load  balancing  operations  form  the  basis  of  the  selection  algorithm  given  in  Sec¬ 
tion  5.3,  and  of  the  sorting  algorithms  presented  in  Chapter  7. 


4.1  Problem  Definition:  Balance 

The  first  load  balancing  problem  to  be  considered,  Balance,  is  defined  as  follows.  Let  n 
tokens  be  distributed  over  p  processors,  with  no  more  than  m  tokens  assigned  to  any  single 
processor,  [n/p]  <  m  <  n.  It  wiU  be  assumed  that  n  =  0{p^)  for  some  constant  c  in  order 
that  calculations  involving  token  counts  can  be  performed  with  a  constant  number  of  CPU 
operations.  The  problem  is  to  redistribute  the  tokens  so  that  each  processor  has  either 
[n/pj  or  [n/p]  tokens,  that  is,  so  that  the  load  is  distributed  as  evenly  as  possible.  Peleg 
and  Upfal  have  exhibited  tight  bounds  for  this  operation  on  a  certain  class  of  expander 
networks  [PU89].  In  many  applications,  it  is  not  necessary  to  balance  the  population  of 
tokens  exactly.  If  the  difference  between  the  maximum  number  of  tokens  at  any  processor 
and  the  minimum  number  of  tokens  at  any  processor  is  it  will  be  said  that  the  tokens 
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have  been  balanced  with  error  The  corresponding  operation  will  be  referred  to  as  Balance 
with  error 


4.1.1  Tight  Bounds  for  the  Pipelined  Hypercube 

This  section  describes  an  algorithm  for  the  Balance  operation  with  minimum  error  that 
runs  in  0{mlogp)  time  on  the  hypercube  or  shuffle-exchange,  and  in  0{m  +  logp)  time 
on  the  pipelined  hypercube.  This  algorithm  is  due  to  Tom  Leighton  [Lei].  Assuming  that 
m  exceeds  \n/p]  by  at  least  some  constant  factor,  there  is  a  trivial  fi(m  +  logp)  lower 
bound  for  the  pipelined  hypercube.  The  fi(m)  term  in  the  lower  bound  holds  since  at  least 
m  “  In/p]  =  Q(m)  time  steps  are  necessary  for  a  processor  initially  holding  m  tokens  to 
send  away  sufficiently  many  tokens  to  reach  \n/p]^  the  maximum  allowable  number  in  any 
balanced  configuration.  The  logp  term  in  the  lower  bound  holds  because,  as  will  be  proven 
rigorously  in  Section  4.1.2,  it  is  possible  to  configure  the  tokens  in  such  a  way  that  no 
token  is  placed  within  f2(logp)  hops  of  a  particular  processor.  Thus,  Leighton’s  algorithm 
provides  tight  bounds  for  the  pipelined  hypercube  when  m  exceeds  [n/p]  by  some  constant 
factor.  Note  that  if  m  does  not  exceed  \nlp\  by  a  factor  of  2  (say),  then  load  balancing  is 
probably  not  necessary  anyway. 

The  1-port  version  of  Leighton’s  algorithm  wiU  now  be  described.  The  algorithm  runs 
in  m  phases,  and  each  phase  takes  care  of  one  token  from  every  processor  for  which  the 
supply  of  tokens  has  not  yet  been  exhausted.  In  a  phase,  the  designated  tokens  are  routed 
to  a  contiguous  block  (with  respect  to  processor  ID  modulo  p)  of  processors.  Each  token  is 
routed  in  exactly  one  phase.  The  first  block  begins  at  processor  0  (say),  and  each  subsequent 
block  begins  at  the  processor  following  the  end  of  the  previous  block.  In  this  manner,  the 
population  of  tokens  gets  distributed  as  evenly  as  possible. 

A  single  phase  of  Leighton’s  algorithm  is  implemented  by  performing  a  prefix  sum 
over  the  designated  tokens  followed  by  a  concentration  route  as  defined  by  Nassimi  and 
Sahni  [NS82].  The  prefix  sum  gives  the  offset  of  each  token  within  the  contiguous  block  of 
destination  processors.  The  size  of  the  block  is  broadcast  so  that  all  processors  can  compute 
the  start  of  the  next  block.  All  of  these  operations  can  be  performed  in  O(logp)  time  ,  so 
the  over- all  running  time  of  Leighton’s  algorithm  is  O(mlogp).  Note  that  this  time  bound 
is  valid  for  the  shuffle-exchange  as  well  as  the  hypercube. 

As  indicated  above,  Leighton’s  algorithm  is  best  suited  for  the  pipelined  hypercube. 
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The  prefix  sum  and  broadcast  operations  can  be  pipelined  on  the  hypercube  but  the  con¬ 
centration  routes  cannot  (provably).  On  the  other  hand,  the  pipelined  hypercube  allows 
concentration  routes  to  be  pipelined  [VD88].  Hence,  the  Balance  operation  can  be  imple¬ 
mented  to  run  in  0(m  +  logp)  time  on  the  pipelined  hypercube. 

4.1.2  A  Lower  Bound  for  the  Hypercube 

A  lower  bound  for  the  running  time  of  Balance  on  the  hypercube  can  be  obtained  by  con¬ 
centrating  all  of  the  tokens  in  a  small  number  of  processors  and  then  bounding  the  time 
required  to  eliminate  the  excess  tokens  from  this  set  of  processors.  In  the  following,  recall 
that  d  =  log  p  denotes  the  dimension  of  the  given  hypercube. 

Definition  4.1.1  Let  r’’(i)  denote  the  set  of  (f)  processors  at  Hamming  distance  r  from 
processor  i,  0  <  r  <  d. 

Definition  4.1.2  Let  B{i,r)  denote  the  complete  Hamming  ball  of  radius  r  centered  at 
processor  i.  More  formally,  this  is  the  set  of  processors  given  by  B{i,r)  —  Uo<;<rr^(i). 

Note  that  |5(i,  r)|  =  Ylo<l<r  (;)•  It  will  also  be  necessary  to  consider  “incomplete”  Ham¬ 
ming  balls,  that  is,  Hamming  balls  with  only  a  partially  filled  outer  layer. 

Definition  4.1.3  Given  a  positive  integer  x,  let  r  and  y  be  the  unique  pair  of  nonnegative 
integers  such  that  x  =  y+  X^o<;<r-i  (/)»  I  ^  2/  ^  (r)-  ^  processors  .B  is  a  Hamming 

hall  of  size  x  and  radius  r  if  there  is  some  processor  i  and  some  set  of  processors  S  such 
that  B  =  B{i,r  -  1)  U  5,  5  C  r’’(i)  and  l^l  =  y.  Let  B{x)  denote  the  set  of  aU  Hamming 
balls  of  size  x. 

Definition  4.1.4  Let  a  graph  G  with  vertex  set  V{G)  and  edge  set  E{G)  be  given.  For 
every  U  C  V(G),  the  fringe  of  U  with  respect  to  G,  IF{G,  17),  is  defined  as  the  set 

{ueU\  (ti,u)  6  E{G)  for  some  v  6  V{G)  \  U}. 

Finally,  let  the  function  f{G,  x)  be  defined  as 

fiG,x)  =  ^nnn^  \HG,U)\, 

\U\=x 

where  G  is  a  graph  and  x  is  an  integer,  0  <  x  <  inG)|. 
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Lemma  4.1.1  The  Balance  operation  requires  fi((n  —  [n/m]  {n/p])/ f{G,  fn/m]))  time. 

Proof:  The  n  tokens  can  be  concentrated  in  a  set  S  of  [n/m]  processors.  Under  a 

balanced  load,  S  contains  at  most  fn/m]  \n/p]  tokens.  Hence,  at  least  n  —  fn/m]  fn/p] 
tokens  must  leave  the  set  S.  Only  processors  located  at  the  fringe  of  5,  those  in  T{G,  5), 
can  send  tokens  out  of  5,  and  these  can  only  transmit  one  token  per  time  step.  Therefore, 
the  running  time  of  Balance  must  be  at  least  (n  -  fn/m]  ln/p])/{f{G,  fn/m])).  [] 

Having  established  Lemma  4.1.1,  the  desired  lower  bound  can  now  be  obtained  by  com¬ 
puting  f{Hd,  fn/m]),  where  Hj,  denotes  the  undirected  graph  corresponding  to  a  hypercube 
of  dimension  d.  Intuitively,  one  might  expect  that  the  value  of  f(Hd,x)  is  determined  by  a 
Hamming  ball  configuration.  The  correctness  of  this  intuition  is  borne  out  by  the  following 
theorem  due  to  Harper  [Har66].  Note  that  Frankl  and  Fiiredi  have  given  a  simpler  proof  of 
the  same  result  [FF81]. 

Theorem  4.1.1  For  every  integer  a:,  0  <  a;  <  p,  there  exists  a  ball  B  6  B{x)  such  that 

/(Hd,a:)=|jr(H,,H)l.  D 

The  following  estimate  of  the  “volume- to-surface”  ratio  of  a  Hamming  ball  is  proven  in 
Appendix  A. 

Lemma  4.1.2  For  positive  integers  d  and  r  =  r{d),  0  <  r  <  d/2,  let  S  —  (ff)  and  let 
V  =  XIo<i<r  (f)-  Then  V  —  implies  that  V/S  =  1  <  fc  <  d.  Q 

Theorem  4.1.2  The  Balance  operation  requires  Q.{k^l'^m)  time  on  the  hypercube  if  m  = 
0(p^/^(ra/p))  and  m  >  2n/p. 

Proof:  Note  that  A;  >  1  since  m  <  n,  and  k  <  logp  since  m  >  2n/p.  Theorem  4.1.1  and 
Lemma  4.1.2  together  imply  that 

1  ^  ^  ^  logp*  Now  consider  the  lower  bound  given  by  substituting  this  equation  into  the 
statement  of  Lemma  4.1.1.  The  numerator,  n  —  [n/m]  [n/p]  is  at  least  n  —  \p/2]  \n/p]  = 
fi(n).  The  denominator,  /(G,  [n/m]),  reduces  to  Hence, 

Balance  requires  il(k^^'^(n/p)p^^^)  =  time  on  the  hypercube.  [] 
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4.1.3  Upper  Bounds  for  the  Hypercube 

Now  consider  the  task  of  obtaining  an  efficient  implementation  of  the  Balance  operation  on 
the  hypercube.  Let  a  sub  cube  of  dimension  1  be  given,  with  a  tokens  at  processor  0  and 
b  tokens  at  processor  1.  Further  assume  that  each  processor  knows  the  number  of  tokens 
that  it  is  holding,  that  is,  the  value  a  (b)  is  stored  in  the  local  memory  of  processor  0  (1). 
In  this  case,  it  is  easy  to  see  that  the  Balance  operation  can  be  performed  over  the  given 
subcube  in  \a  —  b\/2  +  0(1)  time.  This  observation  motivates  the  following  definition. 

Definition  4.1.5  Let  a  set  of  n  tokens  be  arbitrarily  distributed  over  the  processors  of  a 
p  processor  hypercube,  p  >  2.  Let  the  Balance  operation  be  applied  to  each  of  the  p/2 
subcubes  of  dimension  1  induced  by  the  set  of  p/2  edges  across  dimension  i.  Then  the 
hypercube  has  been  balanced  across  dimension  i. 

Clearly  the  amount  of  time  required  to  balance  across  dimension  i  is  ^/2  +  0(1),  where 
^  is  the  maximum  discrepancy  between  the  number  of  tokens  at  a  given  processor  and  its 
neighbor  in  dimension  i. 

Lemma  4.1.3  Let  the  low  and  high  subcubes  with  respect  to  dimension  z  of  a  given  hy¬ 
percube  be  balanced  with  error  Then  after  balancing  across  dimension  i,  the  entire 
hypercube  is  balanced  with  error  (^  +  1- 


Proof:  Initially,  each  processor  in  the  low  subcube  contains  a  number  of  tokens  in  the 
range  [a,  a  +  for  some  integer  a.  Similarly,  each  processor  in  the  high  subcube  contains 
a  number  of  tokens  in  the  range  [6,  b  +  ^]  for  some  integer  6.  Thus,  after  balancing  across 
dimension  i  every  processor  contains  a  number  of  tokens  in  the  range  [[^^J  5 
and 


a  4"  b  2^ 


a  b 


+  < 


a  +  b 


+  ^  +  1, 


which  completes  the  proof.  [] 


By  repeated  application  of  Lemma  4.1.3,  one  finds  that  successively  balancing  across 
every  dimension  of  the  hypercube  yields  an  implementation  of  Balance  with  error  logp.  The 
running  time  of  this  algorithm  is  O(mlogp),  since  no  processor  will  ever  contain  more  than 
m  tokens  and  hence  each  balancing  step  requires  at  most  m/2  4-  0(1)  time.  Furthermore, 
in  the  important  case  m  =  0(n/p),  it  is  possible  to  distribute  the  tokens  so  that  this 
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performance  is  achieved  to  within  a  constant  factor.  In  other  words,  the  worst  case  running 
time  of  such  a  balancing  algorithm  is  0(mlogp)  when  m  =  0(n/p), 

The  following  algorithm  improves  on  this  time  bound  by  making  a  more  careful  de¬ 
composition  of  the  hypercube.  One  additional  definition  is  needed  in  order  to  present  the 
algorithm. 

Definition  4.1.6  Let  the  discrepancy  across  dimension  i,  denoted  represent  the  abso¬ 
lute  value  of  the  difference  between  the  total  number  of  tokens  in  the  high  and  low  subcubes 
with  respect  to  dimension  i. 

The  eificiency  of  the  following  recursive  procedure  for  Balance  with  error  logp  relies  upon  the 
fact  (shown  below)  that  there  is  always  some  dimension  with  a  small  associated  discrepancy. 

Algorithm  Balance 

1.  Each  processor  counts  how  many  tokens  it  has  and  stores  the  result.  This  takes  0{m) 
time. 

2.  Let  /  denote  the  dimension  of  the  subcube  being  balanced  (I  =  d  initially).  If  /  =  0, 
return. 

3.  Compute  (Si,  0  <  i  <  /.  This  involves  performing  /  independent  sums  over  the  entire 
subcube.  These  sum  operations  can  be  pipelined  to  run  in  a  total  of  0(/)  time  [HJ86]. 

4.  Determine  the  dimension  i*  with  least  associated  discrepancy  6i*,  This  takes  0{l) 
time. 

5.  Recursively  balance  the  high  and  low  subcubes  with  respect  to  dimension  i*,  using 
Steps  2  to  6. 

6.  Balance  across  dimension  P,  adjusting  the  token  counts  appropriately.  The  running 
time  of  this  step  is  analyzed  below. 

In  order  to  establish  the  correctness  of  the  preceding  algorithm,  it  is  necessary  to  prove 
that  the  output  hypercube  is  balanced  with  error  logp.  This  follows  easily  by  induction 
using  Lemma  4.1.3. 

To  analyze  the  time  complexity,  only  the  cost  of  Step  6  remains  to  be  determined. 
When  this  step  is  executed,  the  (/  —  l)-dimensional  high  and  low  subcubes  with  respect  to 
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dimension  i*  are  each  balanced  with  error  /  —  1.  Hence,  there  are  integers  a  and  b  such 
that  each  processor  in  the  high  sub  cube  contains  a  number  of  tokens  that  is  in  the  range 
[a,a  +  /  —  1],  and  each  processor  in  the  low  subcube  contains  a  number  of  tokens  in  the 
range  [6,6  +  /  —  1].  Letting  6  =  Si*  gives 

2^-^(6-(a  +  /- 1))  <  6, 

and  so  (6  +  /  -  1)  -  a  <  62^“^  +  2/  -  2.  Similarly,  (a  +  /  ~  1)  6  can  be  bounded  by  the  same 

quantity.  Therefore,  the  cost  of  the  balancing  step  is  0(62"^  +  /).  It  remains  to  bound  the 
minimum  discrepancy  6.  In  the  following  sequence  of  lemmas,  let  A  denote  the  sum  of  the 
discrepancies  in  the  given  hypercube  of  dimension  d, 

Lemma  4*1.4  The  value  of  A  is  maximized  by  packing  all  of  the  tokens  into  a  Hamming 
ball  B  of  size  \n/m\. 

Proof:  Given  an  arbitrary  distribution  of  the  tokens,  it  will  be  shown  that  the  corre¬ 

sponding  value  of  A  is  at  most  that  achieved  by  a  particular  Hamming  ball  configuration. 
First,  transform  the  processor  IDs  of  the  given  hypercube  by  toggling  each  bit  that  corre¬ 
sponds  to  a  dimension  for  which  there  are  more  tokens  in  the  low  sub  cube  than  in  the  high 
subcube.  Note  that  the  transformed  hypercube  yields  precisely  the  same  value  of  A  as  the 
given  hypercube.  It  has  the  additional  property  that  for  every  dimension,  the  high  subcube 
contains  at  least  as  many  tokens  as  the  low  subcube.  Let  u;(i)  denote  the  number  of  I’s 
in  the  d-bit  processor  ID  i,  and  let  n{i)  denote  the  number  of  tokens  at  processor  i.  Now 
observe  that  A  may  be  expressed  as 

A  =  ^  u;(i)n(t)  —  (^  —  w{i))n(i) 

0<i<p  0<i<p 

=  2  w{i)n{i)  —  nrf, 

0<i<p 

where  p  =  2^.  Thus,  maximizing  A  is  equivalent  to  maximizing  '“^(0^(0-  This 

new  sum  is  clearly  maximized  by  distributing  tokens  according  to  the  following  algorithm: 
While  there  are  tokens  left  to  distribute,  put  m  tokens  (or  the  number  of  tokens  remaining, 
if  less  than  m)  into  an  empty  processor  with  largest  w{i)  in  the  set  of  empty  processors. 
The  result  follows  since  this  algorithm  fills  a  Hamming  ball  centered  at  processor  2^—1.  0 

Lemma  4.1.5  Let  an  instance  of  Balance  be  given  for  which  the  tokens  are  packed  into 
a  Hamming  ball  of  size  [n/m].  Then  A  =  Furthermore,  if  m  >  2n/p  and 

[n/m]  =  then  A  = 


4.1.  PROBLEM  DEFINITION:  BALANCE 


39 


Proof:  Assume  the  tokens  are  packed  into  a  Hamming  ball  B  of  radius  r.  The  total 

discrepancy  A  is  bounded  by  m  times  the  number  of  edges  between  B  and  Hd  \  B.  The 
number  of  such  edges  is  maximized  when  r  =  [d/2\,  so  A  = 

For  the  sharper  bound,  Lemma  4.1.2  tells  us  that  the  cardinality  of  the  fringe  of  B  is 

The  number  of  edges  between  B  and  Hd\  B  is  d  -  r  = 
0(d)  (r  is  at  most  [d/2j  since  m  >  2n/p)  times  the  size  of  the  fringe  of  B.  Hence, 
A  =  as  claimed,  [] 

This  section  will  only  make  use  of  the  bound  of  Lemma  4.1.5.  The  more 

detailed  bound  involving  k  will  be  needed  in  the  next  section  in  order  to  analyze  a  load 
balancing  algorithm  involving  multiple  token  types.  The  following  theorem  is  an  immediate 
consequence  of  the  preceding  two  lemmas. 

Theorem  4.1.3  For  any  instance  of  Balance,  the  average  discrepancy  across  a  dimension, 
A/d,  is  0(md“^/^p).  [] 

Recall  that  the  cost  of  the  balancing  step  in  algorithm  Balance  was  shown  above  to  be 
0{S2~^  +  /),  where  S  =  Si*  is  the  minimum  discrepancy.  Now  the  minimum  discrepancy 
is  certainly  no  larger  than  the  average  discrepancy,  so  S  must  be  0(m/“^/^2^)  by  Theo¬ 
rem  4.1.3.  Hence,  the  cost  of  the  balancing  step  is  0(771/""^/^  +  /),  and  the  total  running 
time  of  algorithm  Balance  is  +  /)  =  0{m  log^/^p  +  log2  p). 

Of  course,  this  algorithm  performs  balancing  with  error  logp.  This  should  be  good 
enough  for  most  applications,  but  if  a  minimum  error  (0  if  p|n,  1  otherwise)  balancing  is 
desired,  some  post-processing  is  needed.  Note  that  the  post-processing  task  can  be  viewed 
as  an  instance  of  Balance  with  m  =  logp,  which  can  be  solved  in  O(log^p)  time  using  the 
1-port  version  of  Leighton’s  algorithm  described  in  Section  4.1,1. 

Theorem  4.1.4  The  Balance  operation,  with  minimum  error,  runs  in  O(mlog^/^p+log^p) 
time  on  the  hypercube. 

Proof:  First  apply  algorithm  Balance  described  and  analyzed  earlier  in  this  section  to 

balance  the  load  with  error  logp.  This  takes  0(m  log^/^p  -i-log^p)  time.  Compute  the 
minimum  number  of  tokens,  a,  at  any  processor  and  broadcast  this  value  to  all  processors. 
This  takes  O(logp)  time.  Every  processor  then  sets  aside  a  tokens,  and  the  remaining  tokens 
(of  which  there  are  at  most  logp  at  any  single  processor)  are  balanced  using  Leighton’s 
algorithm  in  O(log^p)  time.  Q 
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4.1.4  Load  Balancing  on  the  Shuffle-Exchange 

As  noted  in  Section  4.1.1,  the  l-port  version  of  Leighton’s  algorithm  runs  on  the  shuffle- 
exchange.  Hence,  the  Balance  operation,  with  minimum  error,  runs  in  O(mlogp)  time  on 
the  shuffle-exchange.  Now  consider  the  following  lower  bound. 

Theorem  4.1.5  Assuming  m  >  2n/p,  the  Balance  operation  requires  fi(ml0g(n/m))  time 
on  the  shuffle-exchange. 

Proof:  Using  techniques  due  to  Leighton  [Lei83],  Cypher  has  proven  that  the  p  pro¬ 

cessors  of  a  shuffle-exchange  can  be  partitioned  onto  c  chips  in  such  a  way  that  fewer 
than  p/2  processors  are  assigned  to  any  single  chip,  and  the  number  of  pins  per  chip  is 
O(p/(cl0g(p/c)))  [Cyp89].  The  pin  count  for  a  particular  chip  C  is  determined  by  the  total 
number  of  edges  joining  a  processor  assigned  to  C  to  a  processor  assigned  to  some  other 
chip.  Letting  SEd  denote  the  graph  corresponding  to  the  shuffle-exchange  of  dimension  d, 
Cypher’s  bound  implies  (by  an  averaging  argument)  that  for  every  integer  x,  0  <  a:  <  p/2, 
there  exists  an  integer  g{x),  x  <  g(x)  <  p/2,  such  that 

f{SEd,gix))  =  O{x/\0gx). 

Now  consider  the  lower  bound  for  Balance  obtained  by  packing  the  n  tokens  into  a  set  S 
of  g(\n/rn])  processors  with  O(|'n/m]/l0g|'n/m])  neighbors.  At  least  n/2  =  fi(n)  tokens 
need  to  be  moved  to  processors  outside  of  the  set  5,  and  each  edge  leaving  S  can  carry 
at  most  one  token  per  time  step.  Hence,  Balance  requires  fi(ml0g(n/m))  time  on  the 
shuffle-exchange  if  m  >  2n/p.  [] 

The  upper  and  lower  bounds  are  tight  for  2n/p  <  m  <  where  e  denotes  an 

arbitrarily  small  positive  constant. 


4.2  Problem  Definition:  MultiBalance 

In  this  section,  a  slightly  more  complicated  load  balancing  problem  than  Balance  will  be 
considered.  Let  n  tokens  be  evenly  distributed  over  p  processors,  that  is,  each  processor 
contains  either  [n/p\  or  \n/p]  tokens.  Each  token  has  an  associated  type.  There  are  g  >2 
different  types  of  tokens,  and  nothing  is  known  about  the  distribution  or  proportion  of  the 
tokens  of  any  particular  type.  The  set  of  tokens  of  a  particular  type  will  be  called  a  group, 
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and  it  will  be  assumed  that  the  g  group  types  are  given  by  the  integers  {0,...,^  -  1}. 
The  problem  is  to  execute  g  Balance  (with  error  operations,  one  over  each  group  of 
tokens.  This  operation  will  be  referred  to  as  MultiBalance  (with  error  ^).  The  motivation 
for  considering  MultiBalance  is  that  it  turns  out  to  be  useful  for  sorting,  as  wiU  be  seen  in 
Chapter  7. 

4.2.1  Upper  Bounds  for  the  Hypercube 

An  implementation  of  MultiBalance  that  runs  in  0((n/p)(log^logp)^/^+^log^  p)  time  on  the 
hypercube  will  now  be  presented.  The  following  definitions,  which  build  on  Definitions  4.1.5 
and  4.1.6,  are  required. 

Definition  4.2.1  Given  an  instance  of  MultiBalance,  the  n  tokens  have  been  multi-balanced 
across  dimension  i  if  and  only  if  each  group  j  has  been  balanced  across  dimension  i,  0  < 
j  <  9- 

Definition  4.2.2  Let  6j  denote  the  discrepancy  across  dimension  i  with  respect  to  group 
J?  0  ^  i  <  5^'  Define  the  multi-discrepancy  across  dimension  i,  denoted  as  the  sum 
Y^Q<j<9 

Algorithm  MultiBalance 

1.  Each  processor  partitions  its  set  of  tokens  into  g  subsets,  one  subset  corresponding  to 
each  of  the  g  token  types.  Each  of  the  subsets  is  counted  and  the  results  are  stored. 
This  takes  0(n/p)  time. 

2.  Let  I  denote  the  dimension  of  the  subcube  being  multi-balanced  {I  =  d  initially).  If 
/  =  0,  return. 

3.  Compute  Sf^ 0  <  i  <  L  This  involves  performing  /  independent  sums  over  the  entire 
subcube  for  each  of  the  g  groups.  Each  set  of  I  sum  operations  can  be  pipelined  to 
run  in  0{l)  time  [HJ86],  so  the  total  running  time  is  0(gl). 

4.  Determine  the  dimension  i*  with  least  associated  multi-discrepancy  .  This  takes 
0(1)  time. 

5.  Recursively  multi-balance  the  high  and  low  subcubes  with  respect  to  dimension  i’*', 
using  Steps  2  to  6. 
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6.  Multi-balance  across  dimension  i*,  adjusting  the  token  counts  appropriately.  The 
running  time  of  this  step  is  analyzed  below. 

The  correctness  of  the  preceding  implementation  of  MultiBalance  with  error  logp  follows 
by  induction  using  Lemma  4.1.3.  If  a  minimum  error  multi-balancing  is  desired,  O(log^p) 
post-processing  per  group  can  be  performed  as  described  in  Section  4.1.  The  total  cost  of 
post-processing  is  thus  0{glog^  p)  time. 

In  order  to  complete  the  analysis  of  the  running  time  of  algorithm  MultiBalance,  it  is 
necessary  to  consider  the  cost  of  Step  6.  Let  Aj  denote  the  sum  of  the  discrepancies  with 
respect  to  group  j  in  the  given  hypercube  of  dimension  d,  Ylo<i<d^i-  denote  the 

sum  of  the  multi-discrepancies  =  So<j<5 

Lemma  4.2.1  For  any  instance  of  MultiBalance,  A^  =  where  k  satisfies  g  = 

Proof:  Let  the  number  of  tokens  in  group  j  be  denoted  ar^,  0  <  j  <  and  consider 

the  contribution  of  Aj  to  A^.  Since  =  ^5  there  is  at  most  one  Xj  that  exceeds 

n/2.  Suppose  that  xi  >  n/2  for  some  group  /.  Then  A/  =  0{{nlp)<Pl‘^p)  =  0{d}/^n)  by 
Theorem  4.1,3.  Now  k  <  logp  =  d  (since  g  >  2),  so  <  dk~^l^  and  A/  =  0{dk'^^l^n). 

Hence,  it  may  be  assumed  that  xj  <  n/2,  0  <  j  <  g.  Let  kj  satisfy  xj/n  =  , 

0  ^  J  <  5^-  The  tokens  of  group  j  can  be  packed  into  processors,  and  by  Lemmas  4.1.4 

and  4.1.5,  Aj  =  =  0{dxjkj^^^)  =  0{d^^^ x j  {n/ x j)).  Consider 

the  function  f{x)  =  xlog^/^(n/a:),  where  x  is  a  real  value  in  the  range  [l,n/2].  One  may 
easily  verify  that  f\x)  <  0  in  this  range.  In  other  words,  /(x)  is  a  concave  function. 
Therefore,  the  sum  5Io<j<5 subject  to  the  constraint  Y,o<j<g^j  —  maximized 
when  all  of  the  Xj’s  are  equal,  that  is,  when  xj  =  n/p.  Forcing  the  Xj’s  to  be  integers  can 
only  decrease  this  sum,  so  A^  =  0{d^^^gf{n/g))  =  0{d^^^g{n/g)log^^^  g)  = 
as  required.  [] 

Lemma  4.2.2  For  any  instance  of  MultiBalance,  the  average  multi-discrepancy  /d 
across  a  dimension  is  0(7i(logp/logp)^/^). 

Proof:  Immediate  from  Lemma  4.2.1,  since  g  =  p^!^  and  k  =  logp/logp.  Q 

Theorem  4.2.1  The  MultiBalance  operation  runs  in  0((n/p)(logplogp)^/^-|-5rlog^p)  time 
on  the  hypercube. 
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Proof:  By  a  simple  extension  of  the  argument  given  in  Section  4.1,  the  time  required 

to  perform  the  multi-balancing  step  is  0{Sf^2~‘  4-  gl).  Now  6^  is  certainly  no  more  than 
fd,  and  the  number  of  tokens  in  a  subcube  of  dimension  I  at  depth  d  —  l  of  the  recursion 
is  0{n2^~'^  -t-  g2^{d  -  /)),  where  the  latter  term  bounds  the  cumulative  effect  of  odd  parity 
in  the  balancing  operations.  Thus,  Lemma  4.2.2  implies  that  the  total  time  expended  in 
Step  6  is 

of  2-‘{n2‘-‘^  +  g2‘id-mogg/lY^^+gl 

\l<l<d 

which  reduces  to  0{{n/p){logglogpy^‘^  glog^  p).  This  dominates  the  time  required  by  all 
other  steps  of  the  algorithm,  including  the  post-processing.  [] 

4,2.2  A  Lower  Bound  for  the  Hypercube 

It  is  easy  to  see  that  -  dgp)/p  is  a  lower  bound  on  the  running  time  of  MultiBalance, 

since  can  only  decrease  by  2p  each  time  step  and  it  is  certainly  less  than  dgp  after  the 
MultiBalance  operation  has  been  performed.  Hence,  exhibiting  a  particular  input  instance 
with  a  high  value  of  A^  gives  a  good  lower  bound  on  the  worst  case  running  time  of 
MultiBalance.  Consider  the  input  instance  given  by  the  following  construction. 

Assume  for  convenience  that  ^  =  2'*,  1  <  r  <  cJ,  and  let  g  =  [rf/rj  - 1  or  [d/rj ,  whichever 
is  odd.  Divide  the  first  qr  bits  of  each  processor  ID  into  r  fields  of  q  contiguous  bits.  The 
ith  field  determines  the  zth  bit  of  an  r-bit  condensed  ID  according  to  the  following  rule, 
0  <  i  <  r.  If  the  majority  of  the  q  bits  in  the  ith  field  are  0,  then  the  ith  bit  of  the 
condensed  ID  is  0;  otherwise,  it  is  a  1.  Note  that  since  q  is  odd  there  will  always  be  a  strict 
majority  of  either  O’s  or  I’s.  Furthermore,  symmetry  implies  that  exactly  2^^^  processors 
share  any  particular  condensed  ID.  In  the  following  lemma,  let  U  denote  such  a  subset  of 
2^“’*  processors. 

Lemma  4.2.3  The  number  of  hypercube  edges  from  processors  in  U  to  processors  outside 
of  U  is  0(rg^/^|t/’|). 

Proof:  By  symmetry,  it  is  sufficient  to  prove  that  the  number  of  such  edges  associated 
with  the  first  (say)  field  is  0(g^/^|i7|).  Also,  it  may  be  assumed  without  loss  of  generality 
that  the  first  bit  of  the  condensed  ID  associated  with  t/  is  a  0.  Let  U'  denote  the  subset 
of  the  processors  of  U  with  |’g/2]  0  bits  and  [g/2j  1  bits  in  the  first  field.  Lemma  A. 1.2 
implies  that  \U^\  =  It  should  be  clear  that  only  the  processors  in  contribute 
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to  the  desired  edge  count,  and  each  of  these  contributes  exactly  \q/2].  Thus,  the  number 
of  edges  leaving  U  that  are  associated  with  the  first  field  is  as  required.  [] 

Theorem  4.2.2  The  MultiBalance  operation  requires  fi((n/p)(log£rlogp)^/^  —  5logp)  time 
on  the  hypercube. 

Proof:  Consider  the  input  configuration  obtained  by  filling  each  of  the  processors  having 
condensed  ID  i  with  n/p  tokens  from  the  ith  group,  0  <  i  <  2^  Lemma  4.2.3  implies  that 
each  group  contributes  0((n/p)(logplogp)^/^)  to  A^,  SO  =  0(n(log5logp)^/^).  As 
argued  above,  this  fact  immediately  implies  the  desired  lower  bound  on  the  running  time 
of  the  MultiBalance  operation.  [] 

4.2.3  Average  Case  Analysis 

Given  n  distinct  tokens  and  p  processors,  there  are 

n\ 

[(n/p)!]P 

different  ways  of  assigning  n/p  tokens  to  each  processor,  assuming  that  n  is  an  integer 
multiple  of  p.  This  section  analyzes  the  average  case  running  time  of  MultiBalance  over  all 
of  these  possible  input  configurations  when  there  are  g  distinct  groups  of  tokens.  The  upper 
bound  to  be  derived  will  be  interesting  for  sufficiently  small  values  of  In  the  following 
discussion,  the  phrase  “with  high  probability”  means  with  probability  1  —  0{p~^)  for  an 
arbitrary  positive  constant  c. 

Let  the  ith  group  contain  ni  tokens,  and  consider  the  expected  contribution  of  group  i 
to  the  total  running  time  of  this  version  of  MultiBalance,  Q  <  i  <  g.  Letting  p  =  2^,  there 
are  (^,)  distinct  subcubes  of  dimension  d! .  Focus  attention  on  one  such  subcube  C,  and 
let  the  random  variable  X  denote  the  number  of  tokens  from  group  i  initially  assigned  to 
some  processor  in  C.  Let  Yj  denote  the  random  variable  that  is  1  if  the  jth  token  of  group 
i  contributes  to  X,  and  0  otherwise,  0  <  j  <  n^.  Letting  9  =  the  expected  value 

of  Yj  is  9,  and  the  expected  value  of  X  is  ni9.  The  variance  of  Yj  is  bounded  above  by 
the  variance  of  the  binomial  distribution  with  probability  9,  Thus,  the  variance  of  X  is  at 
most  ni9(l  —  9)<  ni9.  A  standard  Chernoff  bound  implies  that  with  high  probability,  X  is 
within  0{{ni9 log pY^^)  of  its  expected  value.  Since  there  are  only  p  ^  ^o<d'<d  id')  choices 
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for  C,  every  subcube  of  dimension  d!  contains 


group  %  tokens  with  high  probability,  Q  <  d!  <  d.  Thus,  after  balancing  the  group  i  tokens 
over  subcubes  of  dimension  d',  every  processor  contains 


group  i  tokens  with  high  probability.  The  additive  d'  bounds  the  worst  case  error  in  the 
balancing,  as  given  by  Lemma  4.1.3.  Note  that  this  is  a  pessimistic  estimate  to  apply  to 
the  average  case  behavior  of  MultiBalance,  and  could  certainly  be  improved.  Continuing 
the  analysis,  the  preceding  bound  implies  that  the  total  cost  of  the  jth  multi-balancing 
operation  performed  by  algorithm  MultiBalance  is 


O 


^  ^  j  <  dj  with  high  probability.  Summing  over  j  and  interchanging  the  order  of  summa¬ 
tion,  the  cost  of  all  of  the  multi-balancing  operations  is 


with  high  probability  since  the  sum  over  j  is  dominated  by  the  j  =  0  term.  The  remaining 
sum  is  maximized  by  setting  rii  =  n/g,  0  <  i  <  g,  which  leads  to  a  total  multi-balancing 
cost  of 


with  high  probability.  Note  that  the  glog^  p  term  matches  the  cost  of  post-processing  and 
other  computations  performed  by  MultiBalance. 


The  preceding  analysis  can  also  be  applied  to  the  straightforward  implementation  of 
MultiBalance  that  multi-balances  across  each  of  the  dimensions  in  ascending  order.  Further¬ 
more,  this  version  of  MultiBalance  can  be  made  to  run  as  efficiently  on  the  shuffle-exchange 
as  it  does  on  the  hypercube.  For  the  shuffle-exchange  version,  shuffles  are  not  performed  by 
moving  entire  sets  of  njp  tokens,  but  rather  by  moving  appropriate  subsets  of  the  tokens  to 
make  the  composition  of  the  set  of  tokens  at  each  processor  (that  is,  the  number  belonging 
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to  each  group)  the  same  as  it  would  have  been  if  a  true  shuffle  had  been  performed.  The 
total  cost  of  simulating  the  shuffle  operations  in  this  manner  is  easily  seen  to  be  on  the 
same  order  as  that  of  the  multi-balancing  operations.  This  algorithm  will  be  referred  to  as 
the  shuffle-exchange  version  of  algorithm  MultiBalance.  The  following  theorem  summarizes 
the  two  main  results  of  this  section. 

Theorem  4*2.3  The  average  running  time  of  algorithm  MultiBalance,  as  well  as  that  of 
the  shuffle-exchange  version  of  algorithm  MultiBalance,  is 

0  ( J {n/p)g  logp  +  g  log^  p)  . 


4.3  Summary 

This  chapter  has  described  hypercube  and  shuffle-exchange  algorithms  for  performing  two 
load  balancing  operations:  Balance  and  MultiBalance.  For  the  Balance  operation,  lower 
bounds  were  derived  by  considering  the  particular  input  configuration  obtained  by  packing 
the  tokens  into  a  smallest  possible  set  of  processors  with  low  expansion.  For  the  hypercube, 
an  algorithm  was  given  that  matches  the  lower  bound  to  within  a  multiplicative  constant 
if  m  >  max{2n/p,log^/^p}  and  m  =  0{n/p),  The  lower  bound  for  the  shuffle-exchange  is 
higher  because  the  hypercube  has  better  expansion  properties  than  the  shuffle-exchange. 
Tight  upper  and  lower  bounds  were  obtained  for  the  shuffle-exchange  for  m  in  the  range 
2n/p  <  m  <  where  €  denotes  an  arbitrarily  small  positive  constant. 

Upper  and  lower  bounds  were  given  for  the  MultiBalance  operation  on  the  hypercube. 
These  bounds  are  tight  for  (n/p)(log^logp)^/^  =  Q,{glog^p).  A  straightforward  imple¬ 
mentation  of  MultiBalance  on  the  shuffle-exchange  was  also  described.  Finally,  the  average 
case  complexity  of  the  hypercube  and  shuffle-exchange  implementations  of  MultiBalance  was 
considered.  Not  surprisingly,  these  algorithms  behave  much  better  on  average  than  they  do 
in  the  worst  case. 


Chapter  5 


Upper  Bounds  for  Selection 


This  chapter  describes  three  entirely  different  approaches  to  the  problem  of  selection  on 
the  hypercube  and  shuffle-exchange.  The  first  approach  is  based  on  the  0((loglogn)^) 
algorithm  of  Cole  and  Yap  for  the  parallel  comparison  model  [CY85].  The  speed  of  that 
algorithm  is  based  upon  the  fact  that  small  sets  of  keys  can  be  sorted  very  quickly.  More 
formally,  n  keys  can  be  sorted  in  constant  time  with  processors  on  the  parallel  comparison 
model.  There  exists  an  analogous  result  for  the  hypercube  and  shuffle-exchange,  namely, 
that  n  keys  can  be  sorted  in  O(log7i)  time  with  processors. 

The  second  approach  is  based  on  the  straightforward  EREW  PRAM  selection  algorithm 
of  Vishkin  [Vis83].  The  hypercube  and  shuffle-exchange  implementations  of  this  algorithm 
make  use  of  the  Balance  operation  described  in  Section  4.1.  An  optimal  algorithm  is  obtained 
for  the  pipelined  hypercube. 

The  third  approach  is  not  based  on  any  previous  parallel  algorithm.  The  source  of  its 
efficiency  is  a  sequential  tradeoff  between  preprocessing  and  search  time  in  a  partial  order 
due  to  Borodin  et  al.  [BGLY81].  The  lower  bound  proven  in  Chapter  6  establishes  that, 
for  a  sufficiently  large  ratio  of  keys  to  processors,  the  running  time  of  this  algorithm  is 
asymptotically  optimal  on  a  wide  variety  of  networks. 


5.1  Problem  Definition:  Select 

The  Select  operation  is  defined  as  follows.  Given  n  (9(logp)-bit  keys  and  an  integer  k, 
0  <  k  <  n,  determine  the  A;th  largest  key  and  broadcast  it  to  all  processors.  It  will  be 
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assumed  that  the  set  of  n  keys  is  initially  balanced  with  minimum  error,  that  is,  each 
processor  holds  either  [n/p\  or  [n/p]  keys.  It  will  be  assumed  that  n  =  O(p^)  for  some 
constant  c,  so  that  ajray  indices  and  key  counts  can  be  manipulated  with  a  constant  number 
of  CPU  operations.  Finally,  it  will  be  assumed  that  the  n  keys  axe  distinct.  The  latter 
assumption  is  made  without  loss  of  generality,  since  ties  can  always  be  broken  consistently 
by  making  use  of  the  following  convention. 

1.  If  two  keys  originating  at  different  processors  are  equal,  the  one  originating  from  the 
processor  with  the  higher  ID  is  deemed  to  be  larger. 

2.  If  two  keys  originating  at  the  same  processor  are  equal,  the  one  initially  stored  at  the 
higher  memory  location  is  deemed  to  be  larger. 

This  tie-breaking  procedure  appends  [logn]  =  O(logp)  bits  to  each  key,  and  thus  produces 
at  most  a  constant  factor  overhead. 

Three  selection  algorithms  will  now  be  presented:  SortSelect,  BalanceSelect  and  Search- 
Select. 


5.2  An  Algorithm  Based  on  Sorting 

The  first  selection  algorithm  to  be  considered,  SortSelect,  is  based  on  the  existence  of  a 
fast  sorting  algorithm  when  the  number  of  processors  exceeds  the  number  of  keys  by  a 
polynomial  factor,  li  n  >  p  then  the  algorithm  will  simulate  0{n)  processors,  incurring 
a  slowdown  penalty  of  0(n/p).  Hence,  it  suffices  to  provide  a  selection  algorithm  with 
running  time  O(logploglogp)  for  the  case  n  =  p  in  order  to  establish  the  more  general 
bound  of  0((n/p)logploglogp). 

Algorithm  SortSelect  is  an  adaptation  of  the  parallel  selection  algorithm  of  Cole  and 
Yap  [CY85].  That  algorithm  runs  in  O(loglogp)^  time  on  Valiant’s  parallel  comparison 
model  [Val75],  and  relies  upon  the  fact  that  p^/^  keys  can  be  sorted  in  constant  time 
on  p  processors  to  develop  an  0  (log  log  p)  time  approximation  subroutine  that  achieves 
the  following  degree  of  accuracy.  Given  p  processors  and  n  <  p  keys,  it  returns  a  key 
representing  a  ‘dower  approximation”  to  the  desired  rank  j  key  with  rank  ji  satisfying 

n 

^  4(p/n)i/2 


<  31  <  jy 


(5.1) 
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so  the  approximation  improves  as  p/n  increases.  Similarly,  the  approximation  subroutine 
can  be  made  to  return  an  “upper  approximation”  with  rank  satisfying 

n 


j  <  ju  <j  + 


4{plnY/2- 


(5.2) 


The  ratio  p/n  increases  dramatically  with  each  successive  application  of  the  approximation 
subroutine  because  only  ju  —  ji  keys  survive  to  the  next  round.  When  p/n  reaches 
or  greater,  the  entire  set  of  remaining  candidates  can  be  sorted,  and  the  desired  element 
found,  in  constant  time. 


It  will  now  be  shown  that  the  approximation  subroutine  of  Cole  and  Yap,  which  runs  in 
O(loglogp)  time  in  the  parallel  comparison  model,  can  be  implemented  to  run  in  O(logp) 
time  on  the  hypercube  or  shuffle-exchange.  The  algorithm  outlined  below,  NearSelect,  com¬ 
putes  a  lower  approximation  satisfying  Equation  (5.1)  in  O(logp)  time.  The  task  of  finding 
an  upper  approximation  may  be  handled  in  an  analogous  manner.  In  the  following,  the 
p  given  “physical”  processors  are  slowed  down  by  a  factor  of  4096  in  order  to  simulate 
P  =  4096p  “virtual”  processors.  Physical  processor  i  wiU  simulate  the  4096  virtual  proces¬ 
sors  with  IDs  in  the  range  [40962,4096(z  + 1)).  Initially,  each  of  the  n  =  p  keys  is  “live”,  and 
the  key  located  at  physical  processor  i  is  considered  to  reside  at  virtual  processor  4096i. 


Algorithm  NearSelect 

1.  If  n  <  sort  the  n  keys  in  O(logp)  time  using  MergeSort,  return  the  jth.  key,  and 
halt. 

2.  Note  that  P  and  n  are  both  powers  of  2,  and  P  >  4096n.  If  P/n  is  an  even  power 

of  2,  let  r  equal  (P/n)^/^.  Otherwise,  let  r  equal  (P/2n)^/^,  Divide  the  n  keys  into  s 
short  sets  of  size  r^.  Note  that  r  and  s  =  are  both  powers  of  2.  The  Arth  key  of 
the  zth  short  set  is  located  at  virtual  processor  0  <  k  <  Sort  each  of  the 

short  sets  independently  in  (9(logr)  time  using  MergeSort.  The  virtual  processors 
used  to  sort  the  ith  short  set  are  those  with  IDs  in  the  range  [ir^,  (i  -|-  l)r'^). 

3.  Kill  off  all  but  those  n/r  keys  that  have  a  rank  in  their  short  set  that  is  an  integer 
multiple  of  r.  For  short  set  i,  the  r  surviving  keys  axe  located  at  virtual  processors 

+  kr^^  0  <  k  <  r.  Let  n  equal  n/r.  Let  j  equal  [j/rJ.  Go  to  Step  1. 

It  wiU  now  be  proven  that  NearSelect  has  the  required  performance,  that  is,  it  terminates 
in  O(logp)  time  and  the  rank  ji  of  the  key  ultimately  returned  by  Step  1  is  guaranteed  to 
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satisfy  Equation  (5.1).  The  following  series  of  lemmas  is  very  similar  to  that  proven  by  Cole 
and  Yap  [CY85],  though  some  minor  changes  have  been  made  in  order  to  accommodate  the 
hypercube  and  shuffle-exchange  implementations. 


Lemma  5,2*1  NearSelect  terminates  in  O(logp)  time. 


Proof:  Let  the  values  of  n  and  r  in  the  ith.  iteration  of  the  loop  be  2®*  and  i  >  0, 
respectively.  Letting  p  =  2^,  note  that  P  =  Uo  =  and  bo  =  6.  Furthermore, 

ai^i  =  ai  —  bi  and  the  algorithm  terminates  when  ai  <  {d+  12)/2.  It  is  sufficient  to  prove 
that  the  sequence  {bi}  increases  geometrically,  since  this  would  imply  that  the  algorithm 
terminates  within  O(logd)  =  O(loglogp)  iterations.  Now  observe  that 


bi^i 


> 

> 

> 


{d+12)^{ai^bi)^l 

2 


bi  + 


bi 


-  1 
2~ 


for  aU  bi  >  6  and  thus  for  all  i  >  0.  To  analyze  the  total  running  time  of  the  loop,  note  that 
each  iteration  runs  in  O(logr)  time  so  the  cost  of  the  last  iteration  will  dominate.  Hence, 
the  total  running  time  of  the  loop,  and  of  NearSelect  is  O(logp).  0 

Observe  that  algorithm  NearSelect  has  a  recursive  structure.  Let  S  denote  the  set  of 
live  keys  at  the  start  of  some  iteration  of  the  loop  and  let  T  denote  the  remaining  set  of 
\S\/r  live  keys  at  the  end  of  the  iteration.  To  obtain  a  lower  approximation  to  the  jth  key 
in  5,  NearSelect  first  checks  \S\  to  see  whether  or  not  it  is  small  enough  to  be  sorted  in 
O(logp)  time.  If  so,  sorting  is  performed  and  the  jth.  key  is  returned.  If  not,  it  returns  a 
lower  approximation  to  the  [j/rjth  key  in  the  smaller  set  T,  obtained  recursively. 


Lemma  5.2.2  Let  a  particular  key  belonging  to  5  fl  T  have  rank  A;  in  T  and  rank  in  S. 
Then  rk  —  rs  +  s  <  k'  <  rk. 


Proof:  Associate  with  each  key  in  T  the  r—1  immediately  higher  keys  in  its  short  set.  A 
key  with  rank  A:  in  T  can  be  greater  than  at  most  the  k  lowest  keys  in  T  plus  their  associated 
sets,  in  S,  Hence,  A:'  <  rk.  Similarly,  a  key  with  rank  A:  in  T  must  be  greater  than  the  k 
lowest  keys  in  T  and  at  least  A;  — s  of  their  associated  sets,  in  5.  Hence,  A;  +  (r— 1)(A;  — 5)  <  k\ 
0 
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Lemma  5.2.3  Let  the  lower  approximation  to  the  jth  computed  by  NearSelect  have  actual 
rank  ji  in  S.  Then  j  —  8n/r  <  ji  <  j. 


Proof:  This  result  will  be  proven  by  induction  on  the  number  of  iterations  of  the  loop. 
If  there  is  only  one  iteration,  then  the  entire  set  S  is  sorted  and  ji  =  j.  If  there  is  more 
than  one  iteration,  then  the  induction  hypothesis  implies  that  the  key  returned  as  the 
approximation  to  the  key  in  T  has  rank  A;  in  T  satisfying 

U/»’J  -  8n'/r'  <  k  <  [j/r\, 

where  n'  and  r'  are  the  values  of  n  and  r  used  in  the  next  iteration,  that  is,  n'  =  n/r  and 
r'  is  either  (P/n'Y^'^  or  Applying  Lemma  5.2.2  one  finds  that 

r\j/r\  —  rs  +  s  —  8rn' fr'  <  ji  <  r\J/r\. 


The  right  inequality  implies  that  ji  <  j.  From  the  left  inequality  one  may  obtain 

.  •  /  1  8r(n/r) 

J,  >  + 

>  {j  -  rs)  -  rs  +  s  -  2^/^(n/r)(nr/P)^/^ 

>  j-  2rs  -  2^^‘^{n/r){n/P)^^^ 

>  j  —  2n/r  —  2^^^(n/r) 

>  j  -  8n/r, 


as  required.  Q 


Corollary  5. 2.3.1  The  rank  of  the  key  returned  by  NearSelect,  j/,  is  guaranteed  to  satisfy 
Equation  (5.1). 

Proof:  Use  the  preceding  lemma  and  observe  that 

/  8n  n 

^  ^  -  (P/27i)^/2  “  4(p/7i)i/2’ 

for  r  >  (P/27i)^/^  and  P  =  4096p.  [] 

Algorithm  SortSelect  can  now  be  stated.  Note  that  the  algorithm  will  work  properly 
even  if  rz  <  p,  that  is,  if  only  a  subset  of  the  processors  initially  contain  a  key.  Initially, 
each  of  the  n  keys  is  “live”. 


Algorit hm  SortSelect 
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1.  Count  up  the  number  of  live  keys,  n,  and  broadcast  n  to  all  processors.  This  takes 
O(logp)  time. 

2.  Use  a  prefix  sum  followed  by  a  concentration  route  to  move  the  live  keys  to  the  n  lowest 
numbered  processors,  that  is,  to  processors  0  through  n—l.  This  takes  O(logp)  time. 

3.  Let  n'  be  the  least  power  of  2  that  is  greater  than  or  equal  to  n.  Put  a  dummy  +oo  key 
at  each  processor  in  the  range  n  through  n'  —  1.  Let  n  equal  n'.  This  takes  constant 
time. 

4.  If  71  <  \/!P,  sort  the  n  keys  in  O(logp)  time  using  the  MergeSort  algorithm  of  Nassimi 
and  Sahni  [NS82],  return  the  jth  key,  and  halt. 

5.  Compute  a  lower  approximation  to  the  key  of  rank  j  by  calling  NearSelect.  Compute 
an  upper  approximation  in  an  analogous  manner.  Let  these  two  keys  have  actual 
ranks  ji  and  ju^  respectively.  This  takes  O(logp)  time. 

6.  Broadcast  the  upper  and  lower  approximations  to  all  processors.  This  takes  O(logp) 
time. 

7.  KiU  off  those  keys  that  are  less  than  the  lower  approximation  or  greater  than  the 
upper  approximation.  This  takes  constant  time.  Go  to  Step  1. 

Each  iteration  of  the  main  loop  of  algorithm  SortSelect  takes  0{logp)  time.  A  bound 
will  now  be  established  on  the  number  of  iterations. 

Lemma  5 •2. 4  Algorithm  SortSelect  terminates  within  O (log log p)  iterations.  Hence,  Sort- 
Select  runs  in  O (log  p  log  log  p)  time  on  the  hypercube  and  shuffle-exchange. 

Proof:  Let  n  =  2^"^»  on  the  ith  iteration  of  Step  4.  Note  that  ao  =  0  and  ai  =  1. 

Equations  (5.1)  and  (5.2)  imply  that 

d  —  ^ 

< 

< 

Hence,  at-j-i  >  for  i  >  0  and  the  algorithm  will  terminate  within  0{\ogd)  =  O(loglogp) 
iterations.  [] 


-  a,)  -^d-1 

I  3 
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5.3  An  Algorithm  Based  on  Load  Balancing 

Algorithm  SortSelect  handled  the  case  n  >  p  by  simulating  n  processors,  which  incurs  a 
multiplicative  slowdown  penalty  of  0(n/p).  Thus,  algorithm  SortSelect  does  not  achieve 
optimal  speedup  for  any  value  of  the  ratio  n/p.  On  the  other  hand,  certain  parallel  models 
of  computation  admit  optimal  speedup  for  selection  when  n/p  is  sufficiently  large.  For  the 
EREW  PRAM,  Vishkin  has  exhibited  a  straightforward  selection  algorithm  that  achieves 
optimal  speedup  for  n  =  Q(plogploglogp).  This  result  has  been  improved  by  Cole,  who 
obtained  optimal  speedup  for  n  =  fi(plogplog*  p)  [Col86a]. 

Vishkin’s  algorithm  is  based  on  two  ideas.  First,  if  the  set  5  is  partitioned  into  p  groups 
of  size  n/p,  with  a  single  processor  assigned  to  each  group,  then  the  median  of  each  group 
can  be  computed  in  0(n/p)  time  sequentially,  and  the  median  of  the  resulting  set  of  p 
medians  (which  can  be  obtained  by  using  the  fastest  known  selection  algorithm  for  the  case 
n  =  p)  is  guaranteed  to  be  a  constant  fraction  splitter  for  the  set  5.  In  other  words,  it  is 
greater  than  a  constant  fraction  of  the  keys  in  5,  and  also  less  than  a  constant  fraction  of 
the  keys  in  S  (the  fraction  is  i).  Hence,  by  computing  the  exact  rank  in  S  of  this  median 
of  medians,  a  constant  fraction  of  the  set  S  can  be  discarded  from  further  consideration. 
The  second  idea  is  that  the  keys  which  have  not  been  discarded  can  be  partitioned  into  p 
equal-sized  groups  in  0{n/p)  time.  Iterating  this  process  of  elimination  and  redistribution, 
one  finds  that  the  number  of  keys  remaining  decreases  geometrically  and  the  complexity  of 
Vishkin’s  algorithm  is  0(n/p  +  logpl0g(n/p)),  where  Cole’s  parallel  merge  sort  has  been 
used  to  make  the  additive  term  small  [Col86b]. 

The  selection  algorithm  to  be  presented  in  this  section,  BalanceSelect,  represents  an 
efficient  implementation  of  Vishkin’s  algorithm  for  the  hypercube  and  shuffle-exchange  net¬ 
works.  At  any  given  time,  a  key  that  has  yet  to  be  discarded  by  Vishkin’s  algorithm  wiU 
be  referred  to  as  a  live  key.  The  routine  NearSelect  (defined  in  Section  5.2)  will  be  used  to 
compute  a  live  key  that  is  greater  than,  and  also  less  than,  some  constant  fraction  of  all 
of  the  live  keys.  This  allows  a  constant  fraction  of  the  live  keys  to  be  discarded.  However, 
there  is  no  guarantee  that  any  particular  fraction  of  the  live  keys  within  a  particular  pro¬ 
cessor  wiU  be  eliminated.  In  the  second  phase  of  each  stage,  Balance  is  used  to  redistribute 
the  set  of  live  keys  uniformly  over  the  p  processors.  A  detailed  description  and  analysis  of 
algorithm  BalanceSelect  is  given  below.  Initially,  all  keys  are  “live”. 


Algorithm  BalanceSelect 
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1.  Let  Si  denote  the  set  of  live  keys  located  at  processor  z,  and  let  mi  denote  the  median 

of  Si,  Let  S  =  Uo<i<p*S't  and  let  M  =  {mo, . . .  Let  5  =  maxo<f<p  15^1.  Since 

selection  can  be  performed  in  linear  time  sequentially,  each  processor  can  compute  mi 
with  an  0(|5i|)  time  local  computation.  Hence,  all  of  the  medians  can  be  computed 
in  0(s)  time. 

2.  Run  NearSelect  over  the  set  M  with  j  =  Lp/2J  in  order  to  obtain  an  approximation 
to  the  median  of  S.  Let  m  denote  the  key  given  by  this  approximation.  Note  that 
a  constant  fraction  of  the  keys  in  S  must  rank  lower  (higher)  than  m.  This  takes 
(9(logp)  time. 

3.  Compute  the  rank  f  of  m  in  5  and  broadcast  it  to  all  processors.  This  operation 
takes  0{s  +  log  j?)  time. 

4.  If  f  =  j,  return  m  and  halt. 

5.  If  f  <  each  processor  i  removes  from  Si  those  keys  which  are  less  than  m  and  j  is 
set  to  j  —  f  —  1.  If  f  >  those  keys  which  are  greater  than  m  are  removed  and  j  is 
left  unchanged.  This  takes  0(s)  time. 

6.  Execute  Balance  over  the  remaining  set  of  live  keys.  This  takes  O(slogp)  time  on  the 
shuffle-exchange,  0{s\og^^^  p  -h  log^p)  time  on  the  hypercube  and  0{s  +  log^)  time 
on  the  pipelined  hypercube. 

7.  Determine  whether  or  not  any  processor  contains  more  than  a  single  live  key.  This 
takes  O(logp)  time.  If  so,  go  to  Step  1. 

8.  There  are  at  most  p  live  keys  remaining,  with  0  or  1  at  each  processor.  Now  use 
SortSelect  to  complete  the  selection.  This  takes  O(logj9loglogp)  time. 

From  the  above  analysis,  the  running  time  of  each  iteration  of  Steps  1  to  7  is  dom¬ 
inated  by  the  call  to  Balance  in  Step  6.  Since  s  decreases  geometrically  from  an  initial 
value  that  is  0{nfp),,  the  number  of  iterations  required  is  O(l0g(n/p)).  The  total  run¬ 
ning  time  of  BalanceSelect  is  thus  O {(n/p) log p  +  log ploglogp)  for  the  shuffle-exchange, 
0{{n/p)log^^^  p  +  log^  pl0g{n/p))  for  the  hypercube,  and  0{n/p  +  logploglogp)  for  the 
pipelined  hypercube.  Note  that  the  performance  of  BalanceSelect  on  the  pipelined  hyper¬ 
cube  is  optimal  for  n  >  p  log  p  log  log  p. 


5.4.  AN  ALGORITHM  BASED  ON  SEARCH 


The  asymptotic  complexities  of  NearSelect  and  SortSelect  are  low,  but  hide  rather  large 
multiplicative  constants.  A  more  practical  implementation  of  BaianceSelect  would  make  use 
of  BitonicSort  to  perform  these  selection  operations.  For  sufficiently  large  values  of  the  ratio 
n/p,  this  substitution  has  no  effect  on  the  asymptotic  complexity  of  BaianceSelect. 

5.4  An  Algorithm  Based  on  Search 


The  third  and  final  selection  algorithm  to  be  considered,  SearchSelect,  obtains  efficient 
performance  by  eliminating  a  constant  fraction  of  the  keys  at  every  processor  in  each  iter¬ 
ation.  This  is  not  accomplished  by  redistributing  the  keys  as  in  algorithm  BaianceSelect, 
but  instead  by  searching  for  a  more  accurate  approximation  to  the  desired  key.  A  detailed 
description  and  analysis  of  this  algorithm  will  now  be  presented.  Initially,  all  of  the  keys 
are  “live”. 

Algorithm  Select 

1.  Let  Si  denote  the  set  of  live  keys  located  at  processor  i,  and  let  m,-  denote  the  median  of 

Si.  Let  S  =  Uo<i<p5i  and  let  M  =  {mo, . . .  Let  s  =  maxo<t<p  \Si\.  All  of  the 

medians  can  be  obtained  in  0{s)  time  using  a  linear  time  sequential  method.  Having 
found  the  medians,  partition  the  set  Si  into  its  upper  and  lower  half.  Continue  this 
partitioning  process  to  depth  min{log(n/p),loglogp},  that  is,  until  Si  has  either  been 
fully  sorted  or  has  been  split  into  log  p  subsets,  each  with  approximately  s/  log  p  values. 
Build  a  binary  tree  of  partition  elements  to  facilitate  searching.  The  total  cost  of  this 
preprocessing  is  O(smin{log(n/p),loglogp}).  Given  an  arbitrary  value,  its  rank  in 
Si  can  be  determined  in  O(min{log(n/p),loglogp}  +  s/logp)  time  by  locating  the 
correct  subset  and  then  looking  at  every  key  in  that  subset.  This  sequential  tradeoff 
between  preprocessing  time  and  search  time  is  well  understood,  see  [BGLY81]  and 
[KMR88]. 

2.  Find  that  m  £  M  with  rank  in  S  closest  to  j  (if  there  is  a  tie,  break  it  arbitrarily)  and 

broadcast  it  to  all  processors.  The  key  m  can  be  computed  in  time  0{s  +  log^p)  as 
follows.  First,  let  mj  =  m,-  at  each  processor  i.  Now  sort  the  set  M'  =  (mg, . , .  ,mp_|} 
so  that  m\  =  m^,  where  rrik  has  rank  i  in  M.  This  can  be  done  using  bitonic  sort 
in  O(log^p)  time.  Next  perform  a  binary  search  over  the  set  to  determine  m. 
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This  requires  O(logp)  rank  computations  over  5,  each  of  which  may  be  performed  as 
follows. 

(a)  Let  m'  be  that  key  in  M'  whose  rank  in  S  is  currently  required  by  the  binary 
search.  Broadcast  m',  along  with  the  ID  of  the  processor  that  it  is  stored  in,  to 
all  processors.  This  takes  O(logp)  time. 

(b)  At  each  processor  compute  the  rank  Vi  of  m!  in  Si.  As  observed  above, 
the  preprocessing  performed  in  Step  1  allows  this  operation  to  be  completed 
in  0{s/logp  +  min{log(n/p),loglogp})  time. 

(c)  Sum  the  r,*  values  to  obtain  the  rank  of  m'  in  S.  This  takes  O(logp)  time. 

3.  Let  f  be  the  rank  of  m  in  S.  If  f  =  j,  return  m  and  halt. 

4.  At  each  processor  z,  kill  off  those  keys  in  Si  that  cannot  possibly  have  rank  j  in  S'.  Let 
Ti  be  the  rank  of  m  in  Si  as  computed  in  Step  2b.  Assume  that  f  <  j;  the  case  j'  >  j 
is  similar.  All  of  the  keys  in  Si  with  ranks  in  Si  less  than  or  equal  to  Vi  can  certainly 
be  eliminated.  If  m  >  m^,  note  that  this  has  eliminated  at  least  half  of  the  keys  in 
Si.  If  m  <  m^,  then  the  keys  in  Si  with  ranks  greater  than  or  equal  to  that  of  rui  can 
also  be  eliminated,  since  the  rank  of  mi  in  5  must  be  greater  than  j  in  order  to  avoid 
contradicting  the  choice  of  m.  Once  again,  and  hence  in  all  cases,  the  number  of  live 
keys  in  Si  is  reduced  by  at  least  a  factor  of  2.  Given  the  preprocessing  performed  in 
Step  1,  this  step  can  be  performed  in  0{s/logp  +  min{log(n/p),loglogp})  time. 

5.  Set  j  to  j  —  A,  where  A  is  the  total  number  of  keys  eliminated  in  Step  4  because  their 
rank  in  S  had  to  be  less  than  j.  Note  that  A  can  be  computed  and  broadcast  to  all 
processors  in  O(logp)  time.  Go  to  Step  1. 

The  preceding  analysis  implies  that  each  iteration  of  Steps  1  through  5  executes  in 
0(s  min{log(n/p),loglogp}  +  log^p)  time.  Since  IS't]  is  initially  n/p  and  is  cut  at  least  in 
half  every  iteration,  after  l0g(n/p)  iterations  every  Si  will  contain  at  most  one  element, 
and  the  next  m  computed  in  Step  2  will  have  rank  j  in  S.  Since  s  decreases  geometrically 
from  an  initial  value  of  n/p,  the  total  running  time  is  0((n/p)  min{log(n/p), log lojgp}  + 
log^  p\0g(n/p))  =  0({n/p)loglogp  +  log^  pl0g(n/p)). 

Note  that  the  (n/p)  log  log  p  term  in  the  running  time  is  entirely  due  to  the  cost  of  local 
preprocessing,  as  opposed  to  communication.  The  results  of  this  section  are  summarized 
by  the  following  theorem. 
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Algorithm 

Running  Time 

Transition  Region 

Merges  ort 

SortSelect 

SortSelect 

SearchSelect 

SearchSelect 

0(log2p/  log(p/n)) 

0  (logp  log  logp) 
C>((n/p)  log  plog  logp) 
0(log2p  log  logp) 
0((ra/p)  log  logp) 

n  = 

n  =  0(p) 
n  =  0(plogp) 
n  =  0(plog^p) 

Table  5.1:  Best  known  selection  algorithms  for  the  hypercube  and  shuffle-exchange. 

Theorem  5.4.1  The  SearchSelect  algorithm  runs  on  the  hypercube  and  shuffle-exchange 
in  0{{n/p)log\ogp  +  log^pl0g(n/p))  time.  If  the  values  are  given  in  locally  sorted  form, 
then  SearchSelect  runs  in  O(log^pl0g(n/p))  time.  [] 

Note  that  the  complexity  of  algorithm  SearchSelect  can  be  expressed  in  terms  of  the  cost  of 
the  following  primitive  operations:  sort  of  n  =  p  keys,  broadcast  and  sum.  Thus,  it  adapts 
easily  to  a  variety  of  networks.  To  be  precise,  assume  that  a  particular  network  is  capable 
of  sorting  n  =  p  keys  located  one  per  processor  in  time  Ti  and  can  perform  broadcasting 
and  summing  operations  in  time  T2.  Then  the  running  time  of  algorithm  SearchSelect  may 
be  written  as 

O((ra/p)loglogp4-  (Ti  -1-  T2logp)l0g(n/p)),  (5.3) 

where  the  first  term  disappears  if  the  keys  are  given  in  locally  sorted  form. 

Consider  the  performance  of  algorithm  SearchSelect  on  a  number  of  common  network 
families.  For  the  butterfly,  hypercube  and  shuffle-exchange,  Ti  =  O(log^p)  and  T2  = 
0(logp),  so  the  second  term  in  Equation  (5.3)  is  log^pl0g(ra/p),  and  for  n  =  fi(plog^p)  the 
running  time  of  SearchSelect  is  O((n/p)loglogp).  For  the  d-dimensional  mesh  {d  constant), 
Ti  =  T2  =  ©(p^/*^),  so  the  first  term  dominates  for  n  =  fi(p^+^/‘^log^p/loglogp).  Let  T3 
denote  the  time  required  for  a  given  network  to  perform  selection  over  u  =  p  keys  located 
one  at  each  processor.  If  Talogp  <  Ti  (as  in  the  case  of  the  complete  binary  tree,  for 
example)  then  the  second  term  in  Equation  (5.3)  can  be  reduced  to  (r2  -t-  T3)logpl0g(n/p) 
by  implementing  each  of  the  log(n/p)  binary  searches  over  p  medians  with  logp  selection 
operations  rather  than  a  single  sort. 
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Algorithm 

Running  Time 

Transition  Region 

MergeSort 

SortSelect 

BaianceSelect 

BalanceSelect 

0(log^  pl\og{pln)) 
0  (log  p  log  log;?) 
O(logploglogp) 
0(n/p) 

n  = 

n  =  0(p) 

n  =  ©(plogploglogp) 

Table  5.2:  Best  known  selection  algorithms  for  the  pipelined  hypercnbe. 

This  chapter  has  described  and  analyzed  a  number  of  selection  algorithms  for  the  hypercube 
and  shuffle-exchange.  Table  5.1  summarizes  the  running  times  of  the  best  known  selection 
algorithms  over  ascending  ranges  of  the  ratio  n/p.  For  n  <  pi-®(i/loglogp)^  fastest 
known  selection  method  is  Nassimi  and  Sahni’s  MergeSort  algorithm.  As  the  ratio  n/p  is 
increased  beyond  this  point,  SortSelect  becomes  the  best  known  selection  algorithm.  Finally, 
algorithm  SearchSelect  overtakes  SortSelect  in  the  region  n  =  0(plogp).  The  ranges  of  njp 
have  been  further  subdivided  at  n  =  0(p)  and  n  =  0(plog^p)  in  order  to  isolate  the 
dominant  term  in  the  running  time.  When  the  keys  are  initially  locally  sorted,  the  running 
time  of  SearchSelect  is  reduced  to  O(log^pl0g(7i/p)),  which  represents  an  improvement  for 
n  =  (jj{plogp). 

Table  5.2  summarizes  the  running  times  of  the  best  known  selection  algorithms  for  the 
pipelined  hyper  cube. 


Chapter  6 

A  Lower  Bound  for  Selection 


This  chapter  is  concerned  with  deriving  a  lower  bound  on  the  complexity  of  the  selection 
problem  for  a  certain  large  class  of  networks.  Given  a  set  5  of  n  keys  and  an  integer  it, 
0  <  A;  <  n,  the  selection  problem  is  to  determine  a  key  with  rank  k  in  5.  If  all  of  the  keys 
are  distinct,  there  will  be  a  unique  key  with  rank  k.  The  lower  bound  will  apply  to  the 
special  case  of  the  selection  problem  in  which  all  keys  are  distinct  and  the  key  being  sought 
is  the  median  of  that  is,  k  =  [n/2\.  The  only  operations  allowed  on  keys  are  copy  and 
comparison. 

Given  that  optimal  speedup  of  selection  is  attainable  on  the  EREW  PRAM,  for  n/p 
sufficiently  large,  one  is  led  to  ask  whether  a  similar  result  can  be  achieved  under  a  more 
realistic  model  of  computation  such  as  the  network  model  defined  in  Section  1.1.  In  fact, 
networks  exist  for  which  optimal  speedup  of  selection  is  attainable.  The  sorting  result  of 
Leighton  [Lei85]  and  the  token  distribution  result  of  Peleg  and  Upfal  [PU89]  together  imply 
that  Vishkin’s  algorithm  can  be  implemented  to  run  in  0(n/p  +  logploglogp)  time  on  a 
certain  class  of  bounded  degree  expander  networks.  Given  a  network  corresponding  to  the 
graph  G  =  the  expansion  of  any  subset  U  of  the  vertices  (processors)  is  defined  as 

\T{U)\/\U\j  where  T(U)  denotes  the  set  of  vertices  mV\U  that  are  adjacent  (connected  by 
a  communication  channel)  to  at  least  one  vertex  in  U.  The  expansion  of  a  network  is  the 
minimum  over  all  C  V  such  that  1  <  |i7|  <  \V\/2  of  the  expansion  of  U.  An  expander 
network  is  a  network  with  expansion  fl(l). 

The  main  result  of  this  chapter  is  an  il{(n /p)\oglogp  + log p)  lower  bound  for  selection 
on  any  network  that  satisfies  a  particular  low  expansion  property  defined  in  Section  6.3. 
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The  class  of  networks  satisfying  this  property  includes  all  of  the  common  network  families 
such  as  the  tree,  multi- dimensional  mesh,  hypercube,  butterfly  and  shuffle-exchange.  The 
lower  bound  is  proven  in  Sections  6.2  and  6.3.  Note  that  this  lower  bound  disproves  a 
claim  of  Aggarwal  and  Huang  stating  that  optimal  speedup  is  possible  for  selection  on 
the  hypercube  and  shuffle-exchange  [AH88].  When  n/p  is  sufficiently  large  (for  example, 
greater  than  log^p  on  the  hypercube  and  shuffle-exchange),  the  lower  bound  is  tight  to 
within  a  multiplicative  constant.  The  matching  upper  bound  is  provided  by  the  algorithm 
SearchSelect  presented  in  Chapter  5. 


6.1  The  Lower  Bound  Model 

The  lower  bound  for  selection  proven  in  this  paper  applies  under  a  strictly  more  powerful 
model  of  network  computation  than  the  1-port  model  defined  in  Section  1.1.  In  particular, 
the  only  restrictions  enforced  by  this  model  are  the  following: 

1.  Each  processor  can  send  and/or  receive  at  most  one  key  per  time  step. 

2.  The  only  operations  allowed  on  keys  are  copy  and  comparison,  and  each  processor  can 
perform  at  most  one  such  operation  per  time  step. 

Note  that  an  unlimited  amount  of  computation  and  communication  involving  data  other 
than  keys  can  be  performed  in  each  time  step.  Under  this  model,  it  will  be  proven  that 
any  selection  algorithm  running  on  a  network  satisfying  a  particular  low  expansion  property 
requires  Q,{{n/p)log\ogp  +  logp)  time  steps  in  the  worst  case. 

The  model  of  network  computation  defined  above  wiU  be  referred  to  as  the  lower  bound 
model  throughout  the  remainder  of  this  chapter. 


6.2  A  Restricted  Lower  Bound 

This  section  provides  an  n((n/]?)loglog^  —  logp/loglogp)  time  lower  bound  for  selection 
on  a  complete  network  with  restricted  capability.  In  order  to  simplify  the  exposition,  it  will 
be  assumed  that  n  and  p  are  both  powers  of  2,  n  >  p.  Recall  that  S  denotes  the  set  of  n 
keys,  and  that  there  are  initially  n/p  keys  located  at  each  of  the  p  processors.  The  lower 
bound  established  in  this  section  applies  under  the  model  of  computation  defined  below. 
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Definition  6.2.1  The  restricted  lower  bound  model  is  equivalent  to  the  lower  bound  model 
defined  in  Section  6.1,  with  the  following  additional  restriction.  For  every  C  5,  if  J?  is 
initially  assigned  to  a  set  of  processors  X,  then  at  most  |X|  comparisons  per  time  step  can 
involve  keys  belonging  to  R. 

The  motivation  for  the  restricted  lower  bound  model  is  that  networks  with  poor  ex¬ 
pansion  properties  suffer  from  a  similar,  albeit  less  severe,  inability  to  spread  out  a  con¬ 
centrated  set  of  data  in  order  to  apply  more  processors  to  it.  In  particular,  if  the  size  of 
the  neighborhood  of  a  given  set  of  processors  X  is  a:|X|  where  a  =  o(l/loglogp),  then  in 
0{{n/p)  loglogp)  time  not  even  a  constant  fraction  of  the  \X\n/p  keys  initially  located  in  the 
set  X  can  have  been  moved  or  copied  to  processors  outside  of  X.  Of  course,  the  restricted 
lower  bound  model  is,  by  itself,  quite  unrealistic.  Furthermore,  it  is  unclear  whether  or 
not  selection  can  be  performed  in  0({n/p)\og\ogp)  time  on  this  model,  even  assuming  the 
complete  network.  However,  the  preceding  observation  indicates  that  it  may  be  possible  to 
transfer  a  lower  bound  for  the  complete  network  operating  under  the  restricted  lower  bound 
model  to  a  realistic  network  operating  under  the  lower  bound  model. 

The  remainder  of  Section  6.2  is  devoted  to  proving  an  fi((n/p)loglogp  — logp/loglogp) 
lower  bound  on  the  running  time  of  any  algorithm  for  computing  the  median  on  the  com¬ 
plete  network  operating  under  the  restricted  lower  bound  model.  The  proof  makes  use  of 
an  adversary  argument.  At  each  time  step,  the  algorithm  indicates  which  set  of  at  most  p 
comparisons  it  would  like  to  make,  and  the  adversary  resolves  these  comparisons  sequen¬ 
tially.  Of  course,  the  algorithm  must  respect  the  rules  of  the  restricted  lower  bound  model, 
and  the  adversary  must  resolve  comparisons  in  a  manner  that  is  consistent  with  at  least 
one  total  ordering  of  the  keys.  Sometimes  the  adversary  wiU  give  away  the  outcome  of  a 
comparison  that  has  not  been  performed  by  the  algorithm.  This  is  done  in  order  to  simplify 
the  lower  bound  argument.  Note  that  giving  away  such  additional  information  can  only 
help  the  algorithm. 

It  is  useful  to  keep  in  mind  that  the  restricted  lower  bound  model  does  not  limit  the 
amount  of  computation  or  communication  involving  non-key  data  that  can  be  performed 
in  each  time  step.  Hence,  it  may  be  assumed  that  at  all  times,  every  processor  is  aware 
of  all  of  the  comparison  information  that  has  been  gathered  thus  far.  In  other  words, 
one  may  envision  a  global  controller  that  receives  the  outcome  of  every  comparison  query 
made  in  a  given  time  step,  and  then  performs  an  unbounded  amount  of  computation  in 
order  to  determine  the  next  set  of  comparison  queries.  This  is  essentially  Valiant’s  parallel 
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comparison  model  [Val75],  except  for  the  added  restriction  imposed  by  Definition  6.2.1. 

The  description  and  analysis  of  the  adversary  argument  has  been  divided  into  a  number 
of  parts.  Section  6.2.1  describes  the  information  that  the  adversary  gives  away  at  the  out¬ 
set  of  the  computation.  Section  6.2.2  provides  some  useful  definitions.  Section  6.2.3  states, 
without  proof,  the  invariants  that  wiU  be  satisfied  by  the  adversary.  Section  6.2.4  gives 
the  procedure  by  which  the  adversary  resolves  comparison  queries  made  by  the  algorithm. 
Section  6.2.5  completes  the  construction  of  the  adversary  by  describing  the  additional  in¬ 
formation  given  away  at  certain  points  during  the  computation.  Section  6.2.6  proves  that 
the  adversary  resolves  comparison  queries  in  a  consistent  manner.  Sections  6.2.4  to  6.2.6 
assume  that  the  invariants  of  Section  6.2.3  are  satisfied.  Section  6.2.7  proves  that  the  con¬ 
struction  of  the  adversary  actually  does  ensure  that  these  invariants  are  satisfied.  Finally, 
Section  6.2.8  gives  a  precise  statement  of  the  lower  bound  established  by  the  adversary 
argument. 

6.2.1  The  Initial  Setup 

Before  the  computation  begins,  certain  information  is  given  to  the  algorithm  for  free.  Several 
definitions  are  needed  in  order  to  describe  this  information.  Let  a  block  of  processors  be 
defined  as  a  set  of  processors  B  such  that  the  |J9|n/p  keys  initially  located  in  the  set  B  have 
contiguous  ranks  in  the  set  of  all  keys  5.  Let  A  denote  a  positive  integer  to  be  defined  later 
(it  turns  out  that  A  =  0(logp/loglogp)).  It  will  be  convenient  to  define  the  concept  of  a 
block  at  level  2,  0  <  i  <  A,  which  is  a  block  with  the  following  additional  properties. 

1.  All  blocks  at  level  i  contain  the  same  number  of  processors,  5,-,  and  they  are  pairwise 
disjoint.  Furthermore,  sq  =  p. 

2.  A  block  at  level  i  contains  bi  =  pairwise  disjoint  blocks  at  level  i  +  1, 

0  <  2  <  A  -  1. 

3.  The  size  of  the  blocks  at  level  2  +  1  is  such  that  the  union  of  the  blocks  at  level  2  +  1 
within  a  given  block  B  at  level  i  contains  'y\B\  processors,  0  <  2  <  A  —  1,  where  7  is  a 
real  constant  between  0  and  1  (it  turns  out  that  |  is  an  appropriate  choice  for  7,  see 
below  and  the  proof  of  Lemma  6.2.4). 

Thus,  Si^i  =  jSi/bi  and 
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0  <  i  <  A.  Setting  7  to  the  reciprocal  of  some  power  of  2  will  ensure  that  the  S{'s  are 
integer-valued  as  long  as  p  >  c^il  for  some  constant  c.  Taking  logarithms,  this  requirement 
becomes  logp  >  2Tog2  +  0(i),  which  is  asymptotically  satisfied  for  certain  values  of  A  in  the 
range  0 (log  p/ log  log  p).  Note  that  every  block  has  a  unique  level  that  can  be  determined 
from  its  size. 

The  following  information  is  given  away  by  the  adversary  before  the  computation  begins. 
First,  the  input  permutation  of  the  keys  is  such  that  the  set  of  all  p  processors  forms  a  block 
at  level  0.  This  implies  the  existence  of  a  tree  of  blocks  of  depth  A.  The  algorithm  is  given 
both  the  IDs  of  the  processors  that  make  up  each  of  the  blocks  in  this  tree  as  well  as  the 
ordering  of  the  blocks  within  each  level. 


6.2.2  Useful  Definitions 


The  adversary  argument  proceeds  in  stages  consisting  of  a  number  of  consecutive  time  steps. 
The  number  of  stages  is  given  by  the  positive  integer  A  defined  in  the  preceding  section. 
The  ith  stage  begins  at  time  ti  =  maxo<j<,L(n/p)djJ  -  j  and  ends  at  time  where 
di  =  |log(z  +  1),  0  <  2  <  A.  Note  that  =  0  s-nd  ti^i  >  ti^  0  <  2  <  A.  The  following  pair 
of  technical  lemmas  will  be  useful  for  bounding  the  amount  of  work  that  can  be  performed 
in  a  single  stage,  and  up  to  a  particular  stage. 


Lemma  6.2.1  There  are  at  most  2^p(t\-i)  steps  in  the  2th  stage. 
Proof:  Observe  that  whenever  -  U  is  nonzero,  it  satisfies  the  inequality: 

^.+1  -  ti  <  +  2)J  -  t  -  +  1)J  - 

-  Jib'”/”*'"  (' +  rn-) 

<  — —■ 

21n2p(2  +  1) 


D 


Lemma  6.2.2  The  starting  time  of  the  ith  stage,  ti,  is  at  most  ^(n/p)log(i  +  l),  0  <  i  <  A. 

Proof:  Immediate  from  the  definition  of  tj.  [] 

Like  blocks,  processors  and  keys  are  assigned  unique  level  numbers.  The  level  of  a 
processor  is  i  if  and  only  if  the  highest  level  block  that  it  is  contained  in  is  at  level  i.  The 
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level  of  a  key  is  given  by  the  level  of  the  processor  that  it  is  initially  contained  in.  Note 
that  a  processor  or  key  at  level  i  is  contained  in  exactly  one  block  at  level  J,  0  <  j  < 

At  any  given  time  during  the  computation,  some  subset  of  the  n!  possible  total  orderings 
of  the  keys  remain  consistent  with  all  of  the  information  that  the  algorithm  has  learned. 
Consider  an  arbitrary  pair  of  distinct  keys  x  and  y.  If  x  <  y  in  every  one  of  the  possible 
total  orderings,  then  the  outcome  of  the  comparison  between  x  and  y  is  said  to  be  forced. 

Definition  6,2.2  Once  the  outcomes  of  all  n  —  1  comparisons  involving  a  particular  key 
are  forced,  that  key  is  said  to  be  dead.  Keys  that  are  not  dead  are  live. 

Before  the  beginning  of  each  stage,  the  adversary  will  give  certain  information  away 
(described  in  Section  6.2.1  for  stage  0,  and  in  Section  6.2.5  for  subsequent  stages).  After 
that  information  has  been  given  away,  and  before  the  beginning  of  the  ith  stage,  0  <  i  <  A, 
let  Di  and  Li  =  S  \  Di  denote  the  sets  of  dead  and  live  keys,  respectively.  Note  that  the 
ranks  in  S  of  the  keys  in  Di  have  all  been  determined. 

Let  Ui  denote  the  set  of  all  level  i  keys  in  Li.  Let  Vi  =  Li  \  Ui.  It  will  turn  out  that 
all  keys  of  level  less  than  i  are  dead  by  the  beginning  of  stage  i,  so  Vi  denotes  the  set  of  all 
keys  in  Li  with  level  strictly  greater  than  i. 

6.2.3  Invariants 

This  section  states,  without  proof,  certain  useful  invariants  that  will  be  satisfied  by  the 
adversary.  Section  6.2.7  proves  that  the  construction  of  the  adversary  actually  ensures  that 
these  invariants  are  satisfied. 

As  mentioned  earlier,  the  adversary  gives  away  certain  information  at  the  beginning  of 
the  ith.  stage,  0  <  z  <  A.  After  this  information  has  been  given  away,  and  before  the  ith 
stage  begins,  the  construction  of  the  adversary  will  guarantee  that  the  following  invariants 
hold: 

1.  There  are  integers  j  <  [n/2j  and  k  >  [n/2J  such  that  the  set  of  ranks  in  S  of  the 
keys  in  Di  is  exactly  {0, . . . ,  j  ~  1}  U  {fc, . . . ,  n  -  1}.  Thus,  the  set  of  ranks  in  S  of  the 
keys  in  Li  is  {j,...,A:  ~  1}. 

2.  The  set  Li  is  a  subset  of  a  single  block  at  level  i.  Thus,  every  key  in  Li  is  of  level  i  or 
higher. 


6,2,  A  RESTRICTED  LOWER  BOUND 


65 


Ui  K- 


Figure  6.1:  Extracting  the  sets  Ui^i  and  Vi+i  from  Ui  and  Vi, 

3.  The  key  being  sought,  namely  the  median  of  S,  is  also  the  median  of  Li, 

4.  All  pairs  of  keys  in  Ui  are  incomparable.  Furthermore,  1^7^  >  (1  “*  +  1)* 

5.  The  set  Vi  is  exactly  the  set  of  keys  in  a  single  block  at  level  i  +  1. 

6.  Every  key  in  Ui  is  incomparable  to  every  key  in  Vi. 

7.  Two  or  more  keys  in  Li  could  still  be  the  median  of  5. 

In  particular,  note  that  Invariant  7  implies  that  the  algorithm  cannot  yet  have  terminated 
successfully. 

Figure  6.1  indicates  how  the  sets  17, +1  and  Vi+i  are  related  to  Ui  and  K,  0  <  i  <  A  -  1. 
The  set  Vi  is  a  block  at  level  i  +  1,  and  hence  may  be  partitioned  into  sets  X  and  Fj, 
0  <  i  <  where  X  represents  all  of  the  level  i  +  1  keys  in  Vi,  and  Yj  denotes  the  jth 
block  of  level  i  +  2  in  Vi.  The  adversary  wiU  be  constructed  in  such  a  way  that  Ui^i  is  a 
subset  of  X  and  Vi^i  is  equal  to  for  some  A:,  0  <  A:  <  bi^i. 

6,2.4  Resolving  Comparison  Queries 

This  section  gives  the  procedure  by  which  the  adversary  resolves  comparison  queries  made 
by  the  algorithm. 

Consider  a  single  comparison  made  by  the  algorithm,  between  keys  x  and  y  of  levels  j 
and  A;,  respectively.  Without  loss  of  generality,  assume  that  j  <  k.  If  either  x  or  y  is  a  dead 
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key  (this  will  be  referred  to  as  a  type  A  comparison)  then  the  outcome  of  the  comparison  is 
forced,  and  the  adversary  responds  accordingly.  Similarly,  if  y  does  not  belong  to  the  same 
block  at  level  j  x  (type  B  comparison),  the  response  is  forced.  Otherwise,  y  belongs 
to  the  same  block  at  level  j  as  x,  and  the  two  cases  i  <  j  and  i  =  j  will  be  considered 
separately. 

K  i  =  j  (type  C  comparison)  then  x  belongs  to  Ui  and  y  belongs  to  either  Ui  or  Vi,  The 
adversary  alternately  resolves  queries  of  this  type  by  saying  that  x  is  smaller  than  (larger 
than),  not  only  y,  but  the  entire  set  of  remaining  live  keys.  The  key  x  becomes  a  dead  key. 

In  order  to  determine  how  to  resolve  a  comparison  query  in  the  case  i  <  j  (type  D 
comparison),  the  adversary  consults  a  comparison  tree  that  it  has  been  maintaining  for  the 
block  at  level  j  containing  x.  The  adversary  maintains  such  a  comparison  tree  for  every 
block.  A  comparison  tree  of  the  same  sort  was  used  by  Borodin  et  al.  [BGLY81]  to  obtain 
an  easy  (though  not  their  strongest)  sequential  tradeoff  between  preprocessing  time  and 
search  time  in  a  partial  order.  A  comparison  tree  is  a  binary  tree  with  tokens  placed  at 
certain  nodes.  The  comparison  tree  for  a  block  B  at  level  j  contains  (1  —  '^){nlp)sj  tokens 
corresponding  to  the  keys  of  level  j  in  5,  and  bj  tokens  corresponding  to  the  blocks  of  level 
J  +  1  in  B,  When  it  is  important  to  distinguish  between  these  two  types  of  tokens,  they 
will  be  referred  to  as  key  tokens  and  block  tokens,^  respectively.  At  time  0,  the  key  tokens 
are  all  placed  at  the  root  and  the  block  tokens  are  placed,  one  per  node,  on  the  bj  nodes  at 
depth  logij  =  f*log(j  +  1)]. 

Note  that  every  key  corresponds  to  a  key  token  in  the  comparison  tree  of  exactly  one 
block.  Similarly,  every  block  (except  those  at  the  highest  level,  A  —  1)  corresponds  to  a 
block  token  in  the  comparison  tree  of  exactly  one  block,  namely  that  of  its  parent. 

To  resolve  the  comparison  query  between  keys  x  and  y  in  the  case  i  <  j  (type  D 
comparison),  the  adversary  locates  the  key  token  x'  corresponding  to  x  in  the  comparison 
tree  T  associated  with  the  block  at  level  j  containing  x.  The  adversary  also  locates  the 
token  y'  corresponding  to  y  in  the  same  comparison  tree.  If  j  =  /?,  this  will  be  a  key  token; 
otherwise,  it  wiU  be  the  block  token  corresponding  to  the  unique  block  of  level  y  +  1  that 
contains  y.  Having  located  tokens  x'  and  j/'  in  the  comparison  tree  T,  the  adversary  resolves 
the  comparison  between  keys  x  and  y  as  foUows.  Let  x'  and  y^  reside  in  nodes  u  and  v  of 
the  tree  T,  respectively,  and  let  w  be  the  least  common  ancestor  of  u  and  v.  If  w  is  not 
equal  to  either  u  or  u,  then  the  adversary  does  not  move  any  tokens. 

If  w  is  equal  to  either  u  or  u,  the  adversary  must  move  at  least  one  token  downwards 
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in  the  tree.  If  w  is  equal  to  both  u  and  u,  then  x'  is  moved  to  the  left  child  of  and  y'  is 
moved  to  the  right  child  of  w.  If  w  is  equal  to  u  but  not  u,  then  if  v  is  located  in  the  left 
subtree  of  x'  is  moved  to  the  right  child  of  w.  The  remaining  cases  are  treated  similarly. 

The  preceding  algorithm  describes  how  the  adversary  manipulates  the  tokens  in  each 
comparison  tree,  but  does  not  indicate  how  comparison  queries  are  resolved.  To  resolve 
queries,  the  adversary  interprets  each  comparison  tree  as  the  partial  order  given  by  applying 
the  following  rules: 


1.  The  keys  corresponding  to  two  tokens  in  the  same  comparison  tree  are  incomparable 
if  and  only  if  they  lie  on  a  single  downward  path  from  the  root. 

2.  If  two  tokens  in  the  same  comparison  tree  do  not  lie  on  a  single  downward  path  from 
the  root,  then  the  key  corresponding  to  the  token  lying  to  the  “left”  (that  is,  the  token 
residing  in  the  left  subtree  of  the  least  common  ancestor)  is  deemed  to  be  the  smaller 
key. 

Note  that  the  preceding  rules  apply  to  block  tokens  as  well,  where  the  key  corresponding 
to  the  block  token  is  actually  the  entire  set  of  keys  in  the  corresponding  block.  This  is 
appropriate  since  the  set  of  keys  in  any  block  have  contiguous  ranks  in  5,  and  hence  they 
all  compare  in  the  same  way  to  any  key  outside  of  the  block  (that  is,  if  5  is  a  block  and 
keys  X,  y  and  z  are  chosen  such  that  x,y  £  B  and  z  ^  B,  then  x  <  z  if  and  only  if  y  <  z). 

At  the  beginning  of  the  zth  stage,  consider  the  set  of  comparison  trees  corresponding 
to  the  unique  block  at  level  i  given  by  Invariant  2  and  aU  of  the  blocks  that  it  contains. 
At  any  time  during  the  ith  stage,  the  inequalities  between  live  keys  to  which  the  adversary 
has  committed  itself  are  exactly  those  encoded  by  this  set  of  comparison  trees.  Note  that 
the  initial  configuration  of  the  tokens  in  the  comparison  trees  encodes  precisely  the  infor¬ 
mation  given  away  by  the  adversary  at  the  beginning  of  the  computation,  as  described  in 
Section  6.2.1.  Furthermore,  one  may  check  that  the  token  movements  (if  any)  performed 
in  processing  a  particular  comparison  query  are  always  sufficient  to  resolve  the  query.  In 
Section  6.2.6,  it  wiU  be  proven  that  this  method  of  resolving  queries  can  never  lead  to  an 
inconsistent  response  by  the  adversary. 
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6,2.5  Additional  Information 

This  section  completes  the  construction  of  the  adversary  by  describing  the  additional  infor¬ 
mation  given  away  before  the  beginning  of  each  stage. 

After  stage  i  and  before  stage  let  T  denote  the  comparison  tree  associated  with  the 
block  Vi  given  by  Invariant  5.  The  adversary  counts  the  number  of  key  tokens  residing  on 
each  of  the  bi  paths  of  length  [log(z  -f  1)]  descending  from  the  root  of  T.  Letting  q  denote 
the  path  containing  the  most  key  tokens,  the  adversary  kills  off  certain  keys  in  such  a  way 
that  Ui+i  becomes  the  set  of  key  tokens  on  the  path  g,  and  Vi^i  becomes  the  block  of  keys 
at  level  i  +  2  corresponding  to  the  path  q. 

Before  the  beginning  of  stage  i  -h  1,  the  remaining  live  keys  in  X  =  5  \  ii+i  (recall  that 
Li^i  =  Ui^i  U  K*+i)  3je  killed  off  in  such  a  way  that  the  median  of  S  is  also  the  median 
of  Li^i,  Every  key  that  is  killed  oflF  wiU  either  be  said  to  be  smaller  than  every  key  in 
or  it  will  be  said  to  be  larger  than  every  key  in  Xi+i.  An  arbitrary  consistent  order 
is  maintained  among  the  dead  keys.  Each  key  in  X  of  level  greater  than  or  equal  to  z  +  1 
corresponds  to  a  key  or  block  token  in  the  tree  T  that  does  not  reside  on  the  path  q.  If  the 
token  resides  to  the  left  (right)  of  the  path  then  the  corresponding  key  is  forced  to  be 
smaller  (larger)  than  every  key  in  Xt+i,  in  order  to  ensure  consistency.  On  the  other  hand, 
the  live  level  i  keys  of  X  remain  incomparable  to  every  key  in  ii-j-i.  Hence,  in  killing  off 
each  of  these  keys,  the  adversary  has  the  freedom  to  decide  whether  to  make  it  smaller  or 
larger  than  every  key  in  Tj+i.  In  Section  6,2.7,  it  will  be  proven  that  X  contains  sufficiently 
many  live  level  i  keys  to  force  the  median  of  S  to  be  the  median  of  Li^i . 


6.2.6  Consistency  of  the  Adversary 

This  section  proves  that  the  adversary  resolves  comparison  queries  in  a  manner  that  is 
consistent  with  at  least  one  total  ordering  of  the  keys. 

Section  6,2.4  gave  the  adversary’s  procedure  for  resolving  comparison  queries,  partition¬ 
ing  the  queries  into  types  A,  B,  C  and  D.  For  type  A  and  B  comparisons,  consistency  is 
immediate. 

At  any  given  time  during  the  zth  stage,  let  denote  the  set  of  keys  in  Ui  that  are 
known  to  the  algorithm  to  be  less  than  every  key  in  V{,  and  let  denote  the  set  of  keys 
in  Ui  that  are  known  to  be  greater  than  every  key  in  Vi, 
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Now  consider  the  type  C  comparisons.  As  argued  in  Section  6.2.4,  the  live  keys  in  Ui 
remain  incomparable  to  one  another  and  to  Vi  throughout  the  ith  stage.  Therefore,  at  all 
times  during  the  ith  stage,  any  live  key  in  Ui  could  be  the  minimum  (maximum)  key  among 
all  of  the  remaining  live  keys.  Hence,  the  adversary  can  consistently  kill  off  any  key  in 
Ui  and  assign  it  to  either  or  Uf .  A  similar  argument  can  be  used  to  prove  that  the 
comparison  information  given  away  by  the  adversary  at  the  beginning  of  each  stage  (this 
information  was  described  in  Section  6.2,5)  does  not  lead  to  an  inconsistency. 

The  consistency  of  the  procedure  for  resolving  type  D  comparison  queries  follows  imme¬ 
diately  from  the  following  lemma. 


Lemma  6. 2 ,3  Moving  a  token  downward  in  a  comparison  tree  can  never  result  in  an 
inconsistent  response  by  the  adversary. 


Proof:  Assume  the  lemma  is  false,  and  consider  the  first  downward  movement  of  a  token 
that  results  in  an  inconsistent  response  by  the  adversary.  Assume  the  token  was  moved 
downward  in  comparison  tree  T  corresponding  to  block  B  at  level  i.  First  note  that  while 
a  token  movement  in  T  can  affect  the  partial  order  represented  by  T,  it  cannot  affect  the 
partial  order  represented  by  any  other  comparison  tree.  This  “decoupling”  of  the  partial 
orders  represented  by  the  various  comparison  trees  follows  from  the  use  of  block  tokens  to 
represent  all  of  the  keys  in  block  B  with  level  strictly  greater  than  i.  Since  all  of  the  keys 
within  a  block  B^  at  level  i  +  1  are  treated  as  a  single  key  in  tree  T,  no  comparison  that  is 
resolved  within  T  can  provide  any  ordering  information  regarding  keys  within  block  B\  By 
a  similar  argument,  a  token  movement  in  tree  T  gives  no  information  about  keys  at  levels 
strictly  less  than  i. 

Thus,  the  inconsistency  must  arise  within  the  partial  order  represented  by  comparison 
tree  T  alone.  This  is  impossible  since  moving  a  token  downward  in  a  comparison  tree  can 
only  augment  the  paxtial  order  that  it  represents,  and  a  total  order  over  the  tokens  in  T 
that  is  consistent  with  this  partial  order  is  trivial  to  construct  (by  an  inorder  traversal  of 
the  tree,  for  example).  [] 


The  preceding  discussion  establishes  that,  up  to  any  particular  point  in  the  computation, 
the  behavior  of  the  adversary  is  consistent  with  at  least  one  total  ordering  of  the  keys. 


70 


CHAPTERS,  A  LOWER  BOUND  FOR  SELECTION 


6.2.7  Correctness  of  the  Adversary 

Sections  6.2.4  to  6.2.6  assume  that  the  invariants  of  Section  6.2.3  are  satisfied.  This  section 
proves  that  the  construction  of  the  adversary  actually  does  ensure  that  these  invariants  are 
satisfied. 

The  proof  is  by  induction  on  the  number  of  stages.  It  is  easy  to  check  that  all  of  the 
invariants  are  satisfied  at  the  beginning  of  stage  0,  with  Lq  being  5,  Uq  being  the  set  of 
(1  —  7)n  level  0  keys,  and  Vq  :=  S\Uo  being  the  set  of  keys  in  the  lone  block  at  level  1  (note 
that  6o  =  =  1).  The  induction  hypothesis  is  that  the  invariants  hold  at  the  beginning 

of  the  ith  stage,  0  <  i  <  A  —  1.  It  remains  to  be  proven  that  the  invariants  hold  at  the 
beginning  of  stage  i  +  1.  Of  these,  only  Invariants  3  and  4  do  not  follow  immediately  from 
the  construction  of  the  adversary.  The  task  of  establishing  this  remaining  pair  of  invariants 
will  now  be  addressed. 

The  set  of  keys  Ui  is  initially  contained  in  a  set  of  at  most  (1  —  7)5^  processors  (recall 
that  all  the  keys  in  Ui  are  of  level  z),  so  Definition  6.2.1  and  Lemma  6.2.1  imply  that  only 
j^(n/p)(l  —  7)5t/(z  +  1)  comparisons  made  during  the  ith  stage  can  involve  keys  from 
Ui.  Now  each  such  comparison  (even  if  it  involves  two  keys  from  Ui)  kills  off  only  one 
key  in  Ui,,  and  leaves  the  remaining  live  keys  in  Ui  incomparable  to  one  another  and  to 
Vi.  By  Invariant  4  (which  holds  at  the  beginning  of  stage  i  by  the  induction  hypothesis), 
the  preceding  comparison  bound  implies  that  only  a  fraction  of  the  keys  in  Ui  could 
have  been  killed  off  during  the  zth  stage.  Recall  that  the  adversary  kills  off  keys  in  Ui  by 
alternately  assigning  them  to  Up  and  Ul^ .  Hence,  at  the  end  of  stage  z,  the  algorithm  could 
at  best  have  determined  that  j^\Ui\  of  the  keys  in  Ui  belong  to  Up,  and  that  a  similar 
number  belong  to  (7/^.  At  least  ^1  —  \Ui\  of  the  keys  in  Ui  are  still  incomparable  to  one 

another  and  to  Vi.  Using  the  inequality  of  Invariant  4,  and  the  fact  that  \Vi\  =  {n/p)si^i  = 
'y{n/p)si/bi,  one  finds  that  the  ratio  |i7i|/lVi|  is  at  least  (1  ~  7)/7.  Now  consider  the  proof 
of  the  following  lemma. 

Lemma  6.2,4  For  sufficiently  small  choices  of  7,  any  of  the  keys  in  Vi  could  still  be  the 
median  of  Li  at  the  end  of  stage  i. 

Proof:  The  median  of  S  is  also  the  median  of  Li  by  Invariant  3  (which  holds  at  the 

beginning  of  stage  z  by  the  induction  hypothesis).  Hence,  it  is  sufficient  to  prove  that  the 
size  of  the  set  of  keys  known  to  reside  in  Up  (Uf^)  plus  the  size  of  the  set  Vi  is  less  than 
\Li\/2.  Using  the  inequalities  \Up\  <  '^^lUil  and  |i7ii/|Vi|  >  (1  “  7)/7  proven  above,  this 
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sum  can  be  bounded  by  +l\Li\  at  the  end  of  the  tth  stage.  Thus,  the  lemma  holds 

7  lInfZi'  ^  0.218.  As  mentioned  earlier,  it  is  convenient  to  set  7  to  the  reciprocal  of 
a  power  of  2;  is  an  appropriate  choice.  [] 

Lemma  6.2.4  proves  that  the  adversary  can  kill  off  keys  at  the  beginning  of  stage  f  +  1 
in  such  a  way  that  the  median  of  S  is  also  the  median  of  Li,  as  required  in  Section  6.2.5. 
Thus,  Invariant  3  holds  at  the  beginning  of  stage  f  +  1. 

It  remains  to  prove  Invariant  4.  The  construction  of  the  adversary  ensures  that  the  keys 
of  are  incomparable  at  the  beginning  of  stage  i  +  1,  but  the  lower  bound  on  |f7i+i| 
requires  proof.  Let  T  denote  the  comparison  tree  from  which  the  adversary  extracts  the  set 
The  number  of  key  tokens  in  this  tree  is  equal  to  (1  —  7)(ra/p)s,+i,  and  these  tokens 
reside  initially  in  a  set  of  (1  —  7)si+i  processors.  Let  A  denote  the  average  depth  of  the  key 
tokens  in  T.  Observe  that  every  comparison  made  by  the  algorithm  increments  the  depth 
of  at  most  two  tokens.  Hence,  Definition  6.2.1  and  Lemma  6.2.2  imply  that  A  <  |log(f  +  2) 
at  time  t,+i.  Let  P  denote  the  set  of  6,+i  paths  in  T  from  the  root  to  the  initial  position 
of  each  of  the  block  tokens  in  T,  that  is,  all  paths  of  depth  fIog(f  +  2)] .  Let  aj  denote 
the  number  of  key  tokens  at  depth  j  in  T  at  time  f,+i.  By  a  simple  averaging  argument, 
some  path  in  P  must  contain  at  least 

0<j<  [log(«+2)l 

key  tokens.  This  sum  is  minimized  by  moving  the  tokens  downward  in  a  uniform  fashion. 
Hence,  the  bound  on  A  implies  that  some  path  q'm  P  contains  at  least  a  fraction  2“*°s(*+2)  _ 
of  the  key  tokens.  Therefore  >  (1  -  'r){n/p)si+il{i  +  2),  and  Invariant  4  holds. 

Thus,  the  construction  of  the  adversary  ensures  that  Invariants  1  to  7  all  hold  at  the 
beginning  of  stage  2 ’  +  1,  0  <  i  <  A  —  1,  and  the  proof  by  induction  is  complete. 

6.2.8  The  Lower  Bound 

The  following  theorem  summarizes  the  main  result  of  Section  6.2. 

Theorem  6.2.1  Any  selection  algorithm  for  the  complete  network  running  under  the  re¬ 
stricted  lower  bound  model  requires  |^(ra/p)loglogp  —  O(logp/loglogp)  time  steps. 

Proof:  This  bound  follows  from  Invaadant  7  and  the  definition  of  U,  with  f  =  A  —  1  = 
0(logp/loglogp).  □ 
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The  next  section  proves  that  the  argument  used  to  establish  this  lower  bound  can  be 
adapted  to  a  large  class  of  realistic  networks  running  under  the  lower  bound  model.  Note 
that  for  njp  =  0 (log p/ (log log p)^),  Theorem  6.2.1  does  not  provide  any  useful  information; 
alternative  lower  bound  arguments  need  to  be  applied.  If  every  processor  is  required  to 
receive  a  copy  of  the  median,  then  a  trivial  O(log  p)  lower  bound  holds,  even  for  the  complete 
network  running  under  the  lower  bound  model.  If  this  requirement  is  not  made,  the  task  of 
proving  an  i)(logp)  lower  bound  for  such  a  powerful  network  may  not  be  entirely  trivial.  For 
the  complete  network  running  under  the  restricted  lower  bound  model,  it  is  easy  to  prove 
an  fi(logp)  lower  bound  for  computing  the  maximum  (and  hence  for  selection).  However, 
this  result  is  not  particularly  useful  since  the  proof  does  not  carry  over  to  realistic  networks 
running  under  the  lower  bound  model.  In  any  event,  such  considerations  may  be  avoided 
in  the  special  case  of  n(logp)  diameter  networks,  since  a  simple  fooling  argument  implies 
that  at  least  [d/2]  time  steps  are  necessary  for  any  selection  algorithm  running  (under  the 
lower  bound  model)  on  a  network  with  diameter  d.  Note  that  all  bounded  degree  networks, 
and  all  of  the  networks  considered  in  Section  6.3,  have  ft(logp)  diameter. 


6.3  The  Network  Lower  Bound 

The  purpose  of  this  section  is  to  prove  an  il{(n/p)loglogp  +  logp)  time  lower  bound  for 
selection  on  certain  realistic  networks.  The  lower  bound  will  apply  under  the  powerful  lower 
bound  model  defined  in  Section  6.1,  and  will  be  obtained  by  making  suitable  modifications 
to  the  proof  of  Theorem  6.2.1.  Consider  the  following  definition. 

Definition  6,3. 1  Let  J\f{a^/3)  denote  the  class  of  aU  network  families  T  for  which,  given 
any  p  processor  network  in  .F,  it  is  possible  to  construct  all  of  the  blocks  (as  defined  in 
Section  6.2.1)  at  levels  less  than  /?  in  such  a  way  that  every  block  has  expansion  at  most  a, 
where  a  and  (3  may  depend  on  p, 

A  careful  examination  of  the  proof  of  Theorem  6.2.1  reveals  that  there  are  only  two 
points  at  which  the  definition  of  the  restricted  lower  bound  model  is  invoked;  the  rest  of  the 
proof  applies  to  the  unrestricted  lower  bound  model.  The  first  use  of  Definition  6.2.1  is  in 
the  proof  of  Lemma  6.2.4,  and  the  second  use  leads  to  the  existence  of  a  suitable  set  Ui^i 
in  the  induction  step.  In  both  cases,  Definition  6.2.1  provides  an  upper  bound  of  (1  -  '))si 
on  the  number  of  comparisons  per  time  step  involving  the  set  of  level  i  keys  of  a  particular 
block  at  level  i. 
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Theorem  6.3.1  Let  ^  be  a  network  family  belonging  to  the  class  A/'(o(l/loglogp),/3). 
Then  any  selection  algorithm  for  has  a  running  time  of  at  least  |(n/p)log/3  -  (3  under 
the  lower  bound  model. 

Proof:  In  the  following  argument,  let  X  denote  a  generic  block  at  level  i,  0  <  i  <  -  1. 

Let  Y  denote  the  union  of  the  blocks  at  level  i  +  1  in  X,  and  let  Z  =  X  \  F .  Let  Y'  and  Z' 
denote  the  sets  of  keys  initially  residing  in  Y  and  Z,  respectively.  By  the  remarks  preceding 
the  statement  of  the  theorem,  it  is  sufficient  to  prove  that  the  adversary  construction  of 
Section  6.2  can  be  revised  in  such  a  way  that  the  two  applications  of  Definition  6.2.1  can 
be  avoided. 

The  construction  of  the  adversary  will  be  augmented  in  the  following  manner.  Let  T 
denote  the  comparison  tree  associated  with  block  X.  At  any  given  time  step,  each  of  the 
\Z'\  key  tokens  in  T  is  said  to  be  either  bad  or  good.  A  good  key  token  is  one  for  which  no 
copy  of  the  corresponding  key  has  ever  left  the  set  of  processors  Z.  Thus,  all  key  tokens  are 
initially  good,  and  bad  key  tokens  never  become  good  again.  How  many  good  key  tokens 
in  tree  T  can  become  bad  in  a  single  time  step?  There  are  only  two  ways  for  a  good  key 
token  to  become  bad.  One  way  is  for  a  copy  of  the  corresponding  key  to  be  transmitted 
to  a  processor  outside  of  X.  Since  block  X  has  expansion  o(l/loglogp),  the  number  of 
good  key  tokens  that  can  become  bad  in  this  manner  is  o(|X|/loglogp)  per  time  step. 
The  other  way  for  a  key  token  to  become  bad  is  for  a  copy  of  the  corresponding  key  to  be 
transmitted  to  a  processor  in  Y .  Since  Y  is  the  union  of  a  number  of  disjoint  sets  with 
expansion  o(l/loglogp),  the  number  of  these  events  is  o(|F|/loglogp)  per  time  step.  By 
construction,  \Z\  is  a  constant  fraction  of  |X|,  so  the  total  number  of  key  tokens  that  can 
become  bad  in  a  single  time  step  is  o(|Z|/loglogp).  Now  the  lower  bound  argument  only 
runs  for  O((ra/p)loglogp)  time  steps,  during  which  time  the  number  of  bad  tokens  that  can 
be  generated  is  o{\Z\nlp)  =  o(|Z'|).  In  other  words,  the  number  of  good  key  tokens  in  T  is 
(1  —  o(l))|Z'|  at  all  times  in  the  range  of  interest. 

The  final  modification  to  the  adversary  construction  of  Section  6.2  is  as  follows.  In  the 
induction  step,  the  adversary  now  extracts  the  set  17, +1  from  the  set  of  good  key  tokens 
only.  In  a  single  time  step,  at  most  \Z\  comparisons  can  be  made  involving  keys  in  the 
subset  of  Z'  corresponding  to  the  good  key  tokens  in  T,  since  processors  outside  of  Z  do 
not  have  copies  of  any  of  these  keys.  Thus,  the  argument  of  Section  6.2  goes  through  with 
the  inequality  of  Invariant  4  weakened  by  a  factor  of  1  -  o(l),  the  fraction  of  all  key  tokens 
that  are  guaranteed  to  be  good.  Now  consider  the  proof  of  Lemma  6.2.4.  At  the  beginning 
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of  the  2th  stage,  all  of  the  key  tokens  corresponding  to  Ui  are  good,  and  they  reside  in  a 
set  of  |Z|  processors.  Using  an  expansion  argument  as  above,  Lemma  6.2.1  implies  that  at 
most  a  o(l)  fraction  of  the  key  tokens  corresponding  to  Ui  can  become  bad  during  the  ith 
stage.  This  minor  effect  is  of  no  consequence  since  j  has  already  been  set  to  a  value  that  is 
bounded  away  from  its  maximum  acceptable  value.  A  similar  comment  applies  to  the  effect 
of  the  weakened  inequality  in  Invariant  4.  [] 

Corollary  6.3.1 ,1  Let  .F  be  an  D{logp)  diameter  network  family  belonging  to  the  class 
Ar(o(l/loglogp),0(/3)).  Then  any  selection  algorithm  for  T  has  a  running  time  of  at  least 
|(n/p)log/?  +  fi(logp)  under  the  lower  bound  model. 

Proof:  This  bound  follows  immediately  from  Theorem  6.3.1  and  the  additional  observation 
that  any  selection  algorithm  for  a  network  of  diameter  d  has  a  running  time  of  at  least  [of/2j 
under  the  lower  bound  model.  [] 

6.3.1  The  Hypercube 

Throughout  this  section,  the  quantity  e  will  be  used  to  denote  an  arbitrarily  small  positive 
constant.  A  decomposition  of  the  hypercube  wiU  now  be  defined  that  proves  the  hypercube 
network  family  belongs  to  A/*(o(l/loglogp),  0(log^/^"‘^p)).  Given  a  hypercube  with  p  =  2‘^ 
processors,  let  q  =  or  [d^\  —  1,  whichever  is  odd.  The  exponent  ^  is  a  parameter 
between  0  and  1  to  be  determined  later.  Let  r  =  [d/q\^  and  divide  the  first  qr  bits  of  each 
processor  ID  into  r  fields  of  q  contiguous  bits.  The  ith  field  determines  the  ith.  bit  of  an 
r-bit  condensed  id  according  to  the  following  rule,  0  <  i  <  r.  If  the  majority  of  the  q  bits 
in  the  zth  field  are  0,  then  the  zth  bit  of  the  condensed  ID  is  0;  otherwise,  it  is  a  1.  Note 
that  since  q  is  odd  there  will  always  be  a  strict  majority  of  either  O’s  or  I’s.  By  symmetry, 
2^“^  processors  will  belong  to  any  condensed  subcube  of  dimension  r  —  I  obtained  by  fixing 
the  values  of  /  bits  in  the  condensed  ID,  0  <  /  <  r. 

Lemma  6.3.1  The  expansion  of  a  condensed  subcube  of  dimension  I  is 

Proof:  Let  U  denote  the  set  of  processors  belonging  to  a  particular  condensed  subcube  of 
dimension  /.  By  symmetry,  it  will  suffice  to  consider  the  condensed  sub  cube  corresponding 
to  the  condensed  ID  with  first  I  bits  fixed  to  0,  and  with  the  remaining  r  —  I  unspecified. 
Let  V  denote  the  set  of  processors  in  T{U)\U  that  are  adjacent  to  some  processor  in  V 
across  some  dimension  in  field  0.  It  is  sufficient  to  prove  that  |U|/|[7|  =  But 
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this  ratio  is  readily  seen  to  be  precisely  ^5  which  is  0{q  by  Stirling’s 

approximation.  [] 

Since  I  <r^q^  Q{d^)  and  r  =  Lemma  6.3.1  implies  that  the  expansion  of  every 

condensed  subcube  is  If  8  is  chosen  to  be  any  constant  strictly 

between  2/3  and  1,  then  the  expansion  of  every  condensed  subcube  will  be  O(log“’^p)  = 
o(l/loglogp).  Furthermore,  it  should  be  clear  that  the  condensed  subcube  structure  can  be 
used  to  construct  the  tree  of  blocks  required  by  the  lower  bound  argument  of  Section  6.2, 
at  least  to  a  certain  depth,  since  the  size  of  every  block  is  a  power  of  2.  All  that  remains  is 
to  determine  the  maximum  possible  value  of  /?  as  a  function  of  p.  The  relevant  inequality  is 
^  2^*  for  some  constant  c,  which  is  satisfied  for  /?  =  Setting  8  =  2/3  +  c/2 

gives  /3  =  as  claimed  above.  Slightly  finer  calculations  allow  this  e  to  be  replaced  by 

0(1).  Hence,  Corollary  6. 3. 1.1  implies  the  following  result. 

Theorem  6.3.2  Any  selection  algorithm  for  the  hypercube  has  a  running  time  of  at  least 
^^(n/p)loglogp  +  I2(logp)  under  the  lower  bound  model. 

6.3.2  Other  Networks 

The  above  decomposition  also  works  for  the  shuffle-exchange,  since  it  is  easy  to  prove  that 
Lemma  6.3.1  remains  valid.  Hence,  the  lower  bound  of  Theorem  6.3.2  applies  to  the  shuffle- 
exchange.  Similar  comments  apply  for  the  butterfly  network. 

Low  flux  networks  such  as  the  tree  and  multi-dimensional  mesh  can  be  easily  decomposed 
into  a  large  number  of  equal-sized  components  with  very  poor  expansion.  In  such  cases, 
/?  can  be  increased  so  that  the  lower  bound  of  Theorem  6.3.2  applies  with  an  improved 
multiplicative  constant  of 


6.4  Summary 

The  lower  bounds  for  network  selection  discussed  in  this  chapter  significantly  improve  on 
previously  known  results  when  the  number  of  keys  at  each  processor,  n/p,  is  suflSciently 
large  [GK84].  In  proving  lower  bounds,  it  was  assumed  that  n  and  p  are  powers  of  2,  and 
that  every  processor  begins  with  exactly  n/p  keys.  The  proofs  can  easily  be  extended  to 
handle  arbitrary  values  of  n  and  p  (losing  at  most  a  constant  factor),  and  arbitrary  initial 
distributions  of  the  keys.  Theorem  6.3.1  was  proven  for  network  families  T  belonging  to 
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with  a  =  o{l/loglogp);  for  a  =  Q(l/loglogp)  an  obvious  tradeoff  occurs.  It  is 
likely  that  the  multiplicative  constants  appearing  in  the  lower  bounds  could  be  improved. 
Finally,  it  should  be  emphasized  that  this  work  deals  with  the  worst  case  complexity  of 
selection.  Under  an  average  case  analysis,  and  for  sufficiently  high  values  of  the  ratio  n/p, 
optimal  speedup  of  selection  is  attainable  on  essentially  any  network. 


Chapter  7 


Adaptive  Sorting  Algorithms 


This  chapter  deals  with  the  problem  of  sorting  n  keys  initially  distributed  uniformly  over 
a  hypercube  with  p  processors,  n  >  p.  The  well-known  sequential  lower  bound  for  sorting 
implies  an  fi((n  log  n)/p)  bound  on  the  running  time  of  any  parallel  sorting  algorithm.  For 
the  case  n  =  p,  the  best  known  sorting  algorithm  for  the  hypercube  is  Batcher’s  bitonic 
sort,  which  runs  in  O(log^p)  time  [Bat68].  For  n  /  p,  a  number  of  other  algorithms  have 
been  proposed.  The  running  time  and  range  of  applicability  of  each  of  these  algorithms 
is  summarized  in  Table  7.1.  Note  that  BitonicSort  refers  to  the  straightforward  split-and- 
merge  generalization  of  bitonic  sort,  due  to  Baudet  and  Stevenson  [BS78].  Also,  it  should 
be  emphasized  that  attention  has  been  restricted  to  deterministic,  worst  case  complexity 
algorithms  running  on  the  hypercube.  For  examples  of  results  based  on  other  assumptions, 
the  reader  is  referred  to  [RV87],  [VD88]  and  [Wag86]. 

One  may  verify  that  BitonicSort  provides  optimal  speedup  over  sequential  sorting  only  if 
p  =  0(2v'*°8”).  Two  recent  algorithms,  which  will  be  referred  to  as  CubeSort  (Cypher  and 
Sanz,  [CS88])  and  ColumnSort  (Aggarwal  and  Huang,  [AH88]),  have  improved  this  result 
significantly.  Both  of  these  algorithms  are  optimal  if  n  exceeds  p  by  a  polynomial  factor, 
that  is,  if  ra  =  p^+‘  for  any  constant  e  >  0.  ColumnSort  is  based  on  Leighton’s  technique 
for  sorting  n  values  by  performing  a  constant  number  of  smaller  sorts  [Lei85].  Note  that 
Table  7.1  does  not  indicate  the  running  time  of  ColumnSort  when  e  is  allowed  to  vary.  This 
algorithm  is  not  competitive  for  e  —  o(l)  since  the  hidden  constant  in  the  running  time  is 
proportional  to  (1  +  1/e)"  with  a  =  2/(log  3  - 1)  «  3.419,  as  opposed  to  a  =  1  for  CubeSort. 
Of  course,  it  may  be  possible  to  obtain  an  algorithm  based  on  Leighton’s  column  sorting 
technique  that  achieves  a  smaller  value  of  a. 
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Algorithm 

Running  Time 

Range 

BitonicSort  [Bat68][BS78] 
MergeSort  [NS82] 
ColumnSort  [AH88] 
CubeSort  [CS88] 

0((n/p)  log^p) 
O(log^p/log(p/n)) 
0((ralogn)/p) 
0((n/p)log2p/log(n/p)) 

n  =  n(p) 
n  =  0(p) 

n  =  f  >  0 

n  = 

Table  7.1:  Previous  sorting  algorithms  for  the  hypercube  and  shuffle-exchange. 


The  main  result  of  this  chapter  is  a  new  sorting  algorithm  for  the  hypercube,  SmoothSort, 
that  runs  asymptotically  faster  (in  the  worst  case)  than  any  previously  known  algorithm  over 
a  wide  range  of  the  ratio  n/p.  A  simpler  variant  of  this  algorithm,  which  will  be  referred  to 
as  Quicksort,  will  also  be  presented.  The  running  time  of  QuickSort  is  slightly  greater  than 
that  of  SmoothSort.  The  following  example  illustrates  the  nature  of  the  results.  For  n  = 
plog^  p,  the  sequential  lower  bound  implies  a  lower  bound  of  n(log^p),  the  running  time  of 
BitonicSort  is  O(log^p),  the  running  time  of  CubeSort  is  O(log^  p/ log  log  p),  the  running  time 
of  Quicksort  is  (9(log^/^p)  and  the  running  time  of  SmoothSort  is  O(log^/^p(loglogp)^^/^). 
ColumnSort  is  not  competitive  in  this  range,  and  has  a  running  time  of  about  O(log®*^^^p). 

Both  Quicksort  and  SmoothSort  make  use  of  certain  load  balancing  and  selection  al¬ 
gorithms  given  in  Chapters  4  and  5,  and  these  algorithms  do  not  correspond  to  sorting 
circuits.  In  other  words,  they  are  not  based  solely  on  oblivious  routing  and  compare- 
interchange  operations.  Such  algorithms  wiU  be  referred  to  as  adaptive  sorting  algorithms. 
Chapter  8  describes  two  non-adaptive  sorting  algorithms  that  run  on  the  hypercube  and 
shuffle-exchange,  including  a  slower  version  of  SmoothSort. 


7.1  Problem  Definition:  Sort 

The  Sort  operation  is  defined  as  follows.  Given  n  0(logp)-bit  keys  distributed  uniformly 
over  p  processors  (that  is,  each  processors  holds  at  most  [n/p]  keys),  rearrange  the  n  keys 
so  that  every  key  in  processor  i  is  less  than  or  equal  to  every  key  in  processor  j  whenever 
0  <  2  <  i  <  p.  In  addition,  the  n  keys  should  remain  uniformly  distributed  and  the  set  of 
keys  within  any  particular  processor  should  be  sorted.  As  for  the  Select  operation  defined 
in  Chapter  5,  it  will  be  assumed  that  n  =  O(p^)  for  some  constant  c,  and  that  the  n  keys 
are  distinct. 
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7.2  Sorting  on  the  Hypercube:  Quicksort 

The  following  algorithm  is  based  on  the  well-known  quicksorting  paradigm  of  Hoare  [Hoa62]. 
It  makes  use  of  algorithms  Balance  and  SearchSelect  from  Sections  4.1.3  and  5.4,  respectively. 

Algorithm  QuickSort 

1.  If  the  dimension  of  the  hypercube  being  sorted  is  0,  locally  sort  the  0{n/p)  keys  located 
at  each  processor,  and  return.  If  performed,  this  operation  takes  0{{n/p)log[n/p)) 
time. 

2.  Let  S  denote  the  set  of  n  keys.  Call  SearchSelect  to  find  the  value  with  rank  [n/2]  in 
S.  Let  this  value  be  m.  Using  algorithm  SearchSelect,  this  takes  C)((ra/p)loglogp -f 
log^pl0g(n/p))  time. 

3.  Route  all  keys  that  are  strictly  less  than  m  to  the  low  subcube.  Route  all  keys  that  are 
greater  than  or  equal  to  m  to  the  high  subcube.  To  do  this  each  processor  splits  the 
sorted  list  that  it  currently  holds  into  two  sorted  sublists  and  sends  the  appropriate 
sublist  to  its  neighbor  in  the  highest  dimension.  This  takes  0{n/p)  time. 

4.  At  this  point,  each  processor  contains  between  0  and  2n/p  keys.  CaU  Balance  to 
smooth  out  the  load,  that  is,  to  redistribute  the  keys  so  that  each  processor  holds  a 
list  of  length  [n/pj  or  [n/p].  This  takes  Ci((n/p)log^/^  p -f  log^p)  time. 

5.  Sort  the  low  and  high  subcubes  recursively. 

The  correctness  of  the  preceding  QuickSort  algorithm  should  be  obvious.  The  overall 
time  complexity  of  QuickSort  is  readily  seen  to  be 

0((n/p)  log^/^  p  -f-  log^  p  l0g(n/p)). 

7.2.1  Quicksort  on  the  Pipelined  Hypercube 

As  discussed  in  Section  2.4,  the  existence  of  optimal  merging  algorithms  for  the  pipelined 
hypercube  leads  to  an  optimal  bottom-up  sorting  algorithm  for  n  >  plogp.  The  top-down 
quicksorting  paradigm  leads  to  an  alternative  optimal  sorting  algorithm  for  the  pipelined 
hypercube  for  n  >  plogploglogp.  For  a  fast  implementation  of  QuickSort  running  on 
the  pipelined  hypercube,  the  following  pair  of  changes  should  be  made  to  the  algorithm 
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stated  above.  First,  the  call  to  SearchSelect  in  Step  2  should  be  replaced  by  a  call  to  the 
pipelined  hypercube  implementation  of  BalanceSelect(see  Section  5.3).  Second,  Leighton’s 
pipelined  hypercube  algorithm  for  Balance  (see  Section  4.1.1)  should  be  used  in  Step  4.  One 
may  easily  verify  that  the  running  time  of  this  pipelined  hypercube  version  of  QuickSort  is 
0{{n/p)logp  +  log^ploglogp). 

7.3  A  Faster  Hypercube  Algorithm:  SmoothSort 

The  SmoothSort  sorting  algorithm,  which  is  also  designed  to  run  on  the  hypercube,  will  now 
be  described.  It  makes  use  of  the  MultiBalance  operation  presented  in  Section  4.2. 

Algorit hm  S moothSort 

1.  Locally  sort  the  0(n/p)  keys  located  at  each  processor.  This  takes  0{{n/p)log{n/p)) 
time.  If  p  =  1  then  return. 

2.  Determine  the  2^  keys  with  ranks  ,  0  <  i  <  2^,  and  broadcast  them  to  all  proces¬ 
sors.  These  will  be  called  splitter  keys.  An  appropriate  choice  for  the  parameter  /  will 
be  specified  later.  For  now,  it  will  only  be  assumed  that  2^  <  n/p.  These  selections 
can  each  be  performed  in  O(log^  pl0g(n/p))  time  as  described  in  Section  5.4,  since 
the  keys  have  been  sorted  locally.  Each  broadcast  takes  O(logp)  time.  Thus,  the 
total  time  required  for  this  step  is  O(2^1og^pl0g(n/p)).  Each  processor  now  contains 
a  sorted  list  of  n/p  keys  and  a  sorted  list  of  2^  splitter  keys. 

3.  The  2^  splitter  keys  naturally  partition  the  n  keys  into  2^  groups.  The  ith  group 

consists  of  those  keys  with  ranks  between  and  _  i  inclusive,  0  <  i  <  2^ 

At  each  processor,  label  each  of  the  n/p  local  keys  with  the  appropriate  /-bit  group 
number.  Since  the  list  of  keys  and  the  list  of  splitter  keys  are  sorted,  this  takes 
0(2^  +  n/p)  =  0(n/p)  time. 

4.  Call  MultiBalance  to  smooth  out  each  of  the  2^  groups  of  tokens.  Now  g  =  2^  so  this 
takes  (9((n/p)(/logp)^/^  -|-2^1og^p)  time. 

5.  Loop  over  the  high  order  /  dimensions,  routing  each  group  to  the  appropriate  subcube. 
This  takes  0(/n/p)  time. 
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6.  Call  Balance  to  smooth  out  the  load  in  each  of  the  subcubes.  The  error  in  these 
subcubes  is  at  most  2‘,  so  this  takes  0(2^1og^/^p  +  log^p)  time.  Note  that  if  n/p  is  a 
power  of  2  then  there  is  no  error,  and  this  step  can  be  omitted. 

7.  Sort  the  2^  subcubes  recursively. 


It  remains  to  determine  the  value  of  I  that  minimizes  the  total  running  time  of  Smooth- 
Sort  subject  to  the  constraints  /  >  1,  2^  <  n/p  and  /  <  logp.  From  the  analysis  accompa¬ 
nying  the  description  of  the  algorithm,  it  may  be  seen  that  the  cost  of  the  top  level  of  the 
recursion  (that  is,  excluding  the  recursive  calls)  is  dominated  by  an  expression  of  the  form 


0  (2'log^pl0g(n/p)  +  (ra/p)(/logp)^/2^  , 


for  all  valid  choices  of  1.  The  running  time  of  SmoothSort  is  minimized  (to  within  a  constant 
factor)  by  increasing  I  to  the  point  where  the  cost  of  performing  the  selections  balances 
the  cost  of  the  MultiBalance  operation.  This  leads  to  setting  I  =  l0g{n/{pq)),  where  q  = 
log^/^ploglogp.  Substituting  this  choice  of  /  into  the  above  expression,  one  finds  that  the 
cost  of  a  given  level  of  the  recursion  is  a  function  of  p  and  n/p.  The  value  of  n/p  does  not 
vary  with  the  depth  of  the  recursion,  while  p  is  halved  at  each  level.  H  n  <  pq,  then  I  is 
forced  to  1  and  the  running  time  of  SmoothSort  is  the  same  as  that  of  Quicksort.  If  n  >  pq, 
then  2‘  =  Qin/{pq))  and  the  depth  of  the  recursion  is  0((logp)//).  Furthermore,  the  cost 
of  any  level  is  at  most  that  of  the  top  level,  so  the  total  running  time  of  SmoothSort  is 

0  ^(„/p)logp^^Z^^+l„g3pl„g(„/p)^  . 

Note  that  the  deviation  from  optimality  (that  is,  from  the  time  required  by  the  sequential 
lower  bound)  for  n  >  pq  is  given  by  the  square  root  factor.  Previously,  the  best  known 
algorithm  for  this  range  was  CubeSort,  which  deviates  from  the  lower  bound  by  a  factor  of 
logp/log(n/p). 


The  MultiBalance  algorithm  of  Section  4.2.1  is  inappropriate  for  the  shuffle-exchange 
because  it  does  not  access  the  dimensions  predominantly  in  ascending/ descending  order. 
The  shuffle-exchange  version  of  MultiBalance  described  at  the  end  of  Section  4.2.3  has  a 
worst  case  running  time  of  0((n/p)  logp-1- g log^ p).  This  leads  to  a  worst  case  running  time 
of 


0{{n/p)  log^  p/  l0g(n/p)  +  log^  p  l0g(n/p)) 


for  the  shuffle- exchange  implementation  of  SmoothSort. 
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Algorithm 

Running  Time 

Transition  Region 

MergeSort 

BitonicSort 

hybrid 

SmoothSort 

0(log2p/log(p/n)) 
0{(n/p)log'^p) 
C»((n/p)2/3  log^  pl0gV3(n/p)) 
0{{nfp)  log^/2  p!  l0gi/2(n/(p9)) 

n  =  0(p) 
n  =  0(p) 
n  =  Q{pq) 

Table  7.2:  Running  times  of  sorting  algorithms  for  the  hypercube. 


7.3.1  Average  Case  Analysis 

Theorem  4.2.3  shows  that  the  hypercube  and  shuffle-exchange  implementations  of  Multi- 
Balance  perform  much  better  on  average  than  in  the  worst  case.  When  n/p  exceeds  a 
sufficiently  large  polylogarithmic  factor,  one  may  verify  that  the  non-optimality  of  algo¬ 
rithm  SmoothSort  is  entirely  due  to  the  cost  of  performing  the  MultiBalance  operations.  In 
fact,  the  following  result  holds. 

Theorem  7.3.1  The  average  running  time  of  both  the  hypercube  and  shuffle-exchange 
implementations  of  SmoothSort  is 

0({nfp)  log  p  +  log^  pl0g(n/p)), 

which  is  optimal  for  n  >  plog^  ploglogp. 

The  same  theorem  can  be  proven  for  QuickSort. 


7.4  Summary 

Table  7.2  summarizes  the  running  times  of  the  best  known  deterministic  sorting  algorithms 
for  the  hypercube  over  ascending  ranges  of  the  ratio  n/p.  MergeSort  is  listed  first  because 
it  is  the  best  known  sorting  method  (in  the  sense  of  worst  case  asymptotic  complexity) 
when  n  <C  p.  The  last  column  indicates  that  MergeSort  remains  the  best  known  algorithm 
up  to  n  =  0(p),  at  which  point  BitonicSort  has  the  same  complexity.  The  ‘^hybrid”  entry 
refers  to  an  algorithm  to  be  defined  and  analyzed  in  Section  8.2.4.  For  n  =  fi(p^),  where 
q  =  log^/^ploglogp,  SmoothSort  is  the  best  known  sorting  algorithm  and  its  complexity  is 
given  by  the  last  entry  in  the  table.  Of  course,  when  n  exceeds  p  by  a  polynomial  factor, 
CubeSort  and  ColumnSort  also  exhibit  optimal  complexity.  Two  more  algorithms  with  this 
property  will  be  described  in  Chapter  8. 
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The  running  times  stated  in  Table  7.2  for  the  hybrid  algorithm  and  SmoothSort  do  not 
apply  to  the  shuffle-exchange.  The  running  times  of  the  fastest  known  sorting  algorithms 
for  the  shuffle-exchange  axe  summarized  in  Table  8.1  of  Section  8.3. 

In  contrast  with  such  non-adaptive  sorting  algorithms  as  BitonicSort  and  CubeSort,  the 
average  case  complexity  of  SmoothSort  is  not  equal  to  its  worst  case  complexity.  In  fact, 
the  average  case  complexity  of  SmoothSort  is  optimal  for  n  >  plog^ploglogp.  This  state¬ 
ment  applies  to  the  shuffle-exchange  implementation  of  SmoothSort  as  well.  By  simultane¬ 
ously  guaranteeing  good  worst  case  performance,  SmoothSort  avoids  the  potential  pitfaUs 
of  a  simpler  scheme  such  as  HyperQuickSort  [Wag86].  For  solving  the  related  problem  of 
permutation  routing,  SmoothSort  is  even  more  practical  because  the  cost  of  performing  se¬ 
lections  goes  away.  On  the  shuffle-exchange,  SmoothSort  performs  permutation  routing  in 
0{{n/p)\og^  p/\0g{n/p))  time.  The  constant  hidden  by  the  0-notation  is  small  and,  unlike 
CubeSort,  this  bound  holds  for  all  u  >  p. 


Chapter  8 


Non- Adaptive  Sorting  Algorithms 


This  chapter  deals  with  non-adaptive  sorting  algorithms,  that  is,  algorithms  based  solely 
on  oblivious  routing  and  compare-interchange  operations.^  There  are  several  important 
reasons  for  considering  this  restricted  class  of  algorithms. 

1.  Fast  hardware  can  be  used  to  implement  the  small  number  of  operations  required  by 
non- adaptive  algorithms. 

2.  Non-adaptive  algorithms  tend  to  perform  very  little  local  computation,  and  hence  are 
likely  to  run  quickly  on  computers  for  which  the  cost  of  communication  is  low  relative 
to  the  cost  of  local  computation. 

3.  Because  non-adaptive  algorithms  are  based  on  a  small  number  of  simple  operations, 
they  are  more  likely  to  run  efficiently  on  a  wide  variety  of  parallel  models. 

4.  In  the  case  n  =  p,  non-adaptive  algorithms  correspond  to  sorting  circuits.  It  would  be 
interesting  to  determine  whether  or  not  there  exists  a  o(log^  n)  depth  sorting  circuit 
that  can  be  simulated  in  o(log^  n)  time  by  a  non-adaptive  sorting  algorithm  running 
on  the  hypercube  or  shuffle-exchange. 

With  respect  to  the  last  point,  it  should  be  mentioned  that  Ajtai,  Komlos  and  Szemeredi 
have  developed  an  optimal  O(logn)  depth  sorting  circuit  [AKS83].  Unfortunately,  the  O- 
notation  hides  an  impractically  large  constant  factor.  Furthermore,  no  efficient  simulation 
of  the  AKS  sorting  circuit  has  been  found  for  the  hypercube,  shuffle-exchange  or  any  other 
common  network  family. 

^For  n  >  p,  compare-interchange  is  generalized  to  merge-and-split  type  operations. 
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This  chapter  describes  a  non-adaptive  version  of  SmoothSort  that  runs  on  the  shuffle- 
exchange  and  exhibits  the  same  asymptotic  performance  as  CubeSort  for  n/p  sufficiently 
large.  A  sorting  circuit  based  on  recursive  merging,  called  SquareSort,  is  also  presented. 
SquareSort  performs  a  large  merging  task  by  decomposing  it  into  a  number  of  smaller  ones, 
and  can  be  efficiently  implemented  on  the  hypercube  and  shuffle-exchange.  The  decom¬ 
position  technique  is  similar  to  that  considered  by  Van  Voorhis,  but  obtains  a  more  rapid 
decrease  in  the  size  of  the  subsorts  [Van71].  Finally,  three  hybrid  algorithms  based  on 
tradeoffs  between  SquareSort  and  other  sorting  algorithms  are  defined  and  analyzed. 


8.1  A  Non-Adaptive  Version  of  SmoothSort 

This  section  describes  a  non-adaptive  implementation  of  SmoothSort  that  runs  on  the  shuffle- 
exchange  as  well  as  the  hypercube.  It  is  interesting  to  note  that  this  algorithm  performs 
no  explicit  selections.  The  algorithm  is  described  below  in  terms  of  the  hypercube,  but  can 
easily  be  adapted  to  run  in  the  same  asymptotic  time  bound  on  the  shuffle-exchange.  Let 
d  denote  the  dimension  of  the  hypercube  being  sorted. 

Algorithm  SmoothSort 

1.  Locally  sort  the  0{n/p)  keys  located  at  each  processor.  This  takes  0({n/p)log{n/p)) 
time.  If  d  =  0  then  return. 

2.  For  i  =  0  to  d—  1,  merge  pairs  of  lists  across  dimension  i.  Each  of  the  resulting  merged 
lists  is  of  length  2n/p.  Partition  each  such  list  into  two  sublists  of  length  n/p,  one 
consisting  of  the  even-ranked  keys,  and  the  other  consisting  of  the  odd-ranked  keys. 
Send  the  sorted  lists  of  even-ranked  keys  to  the  low  subcube,  and  the  odd-ranked  keys 
to  the  high  subcube.  This  set  of  merge-unshuffle- split  operations  takes  0{{n/p)d) 
time. 

3.  For  i  =  0  to  d  —  1,  merge  pairs  of  lists  across  dimension  i.  Partition  each  of  the 
resulting  merged  lists  into  two  sublists  of  length  n/p,  one  consisting  of  the  lowest  n/p 
keys,  and  the  other  consisting  of  the  highest  n/p  keys.  Send  the  low  list  to  the  low 
subcube,  and  the  high  list  to  the  high  subcube.  This  set  of  merge- and- split  operations 
takes  0((n/p)d)  time. 

4.  Let  d'  be  as  given  by  Equation  (8.1)  below.  If  d'  >  1,  then  sort  subcubes  of  dimension 
d'  recursively  using  Steps  2  to  6. 
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5.  Let  Li  denote  the  sorted  list  of  length  {n/p)2^'  located  in  the  zth  (low-order)  subcube 

of  dimension  d',  0  <  i  Merge  L2j  with  X2j+i  by  reversing  L2j-^i  and  then 

performing  a  bitonic  merge,  0  <  j  <  This  takes  0{(nfp)d^)  time. 

6.  Merge  L2i+i  with  X2i-i-2?  0  <  j  —  1.  This  can  be  done  as  in  the  previous  step, 

except  that  it  is  necessary  to  perform  a  monotone  route  first  in  order  to  move  each 
pair  of  lists  to  be  merged  into  a  single  subcube  of  dimension  2^^"^^.  The  monotone 
route  that  sends  the  data  at  processor  i  to  processor  i  +  2^'  mod  p,  0  <  i  <  p,  is 
appropriate.  The  inverse  monotone  route  must  be  applied  after  the  merging  has  been 
performed.  This  takes  0{{n/p)d)  time.  Note  that  the  time  bound  depends  on  d,  and 
not  d',  due  to  the  monotone  routes.  A  useful  trick  described  at  the  end  of  this  section 
shows  that  the  monotone  route  operations  can  be  avoided,  reducing  the  complexity 
of  this  step  to  0((n/p)d')  time. 

As  Smooth  Sort  is  based  on  compaxe-in  ter  change  operations,  it  is  sufficient  to  consider  its 
performance  on  inputs  consisting  entirely  of  O’s  and  I’s.  This  fact  is  known  as  the  zero-one 
principle  [Knu73].  Accordingly,  assume  that  the  input  consists  of  A;  O’s  and  n  —  k  I’s  for 
some  arbitrary  integer  fc,  0  <  fc  <  n.  Note  that  the  effect  of  the  zth  merge-unshuffle-split 
operation  of  Step  2  is  to  balance  the  number  of  O’s  (and  I’s)  between  neighboring  processors 
across  dimension  z.  Hence,  Lemma  4.1.3  implies  that  there  exists  a  nonnegative  integer  a 
such  that,  after  Step  2  has  been  completed,  every  processor  contains  a  number  of  O’s  in  the 
range  [a,  a  +  d].  Furthermore,  the  inductive  proof  of  Lemma  4.1.3  can  easily  be  augmented 
to  show  that  if  some  processor  does  contain  d  more  O’s  than  another,  then  processor  0  is 
the  unique  processor  with  a  +  d  O’s,  and  processor  2^  —  1  is  the  unique  processor  with  a  O’s. 

It  is  useful  to  think  of  the  n  keys  as  being  arranged  in  a  {njp)  x  p  array,  where  the 
zth  largest  key  in  processor  j  resides  in  row  z  and  column  i,  0  <  z  <  n/p,  0  <  j  <  p. 
Intuitively,  Step  2  is  attempting  to  arrange  the  keys  in  row-major  order.  On  the  other 
hand,  the  goal  of  Step  3,  and  of  the  sort  as  a  whole,  is  to  arrange  the  keys  in  column-major 
order.  Some  additional  notation  is  needed  in  order  to  measure  the  actual  progress  made 
by  these  steps.  Let  J?o(^i)  —  pi  +  j  denote  the  estimated  rank  of  the  key  in  row  z  and 
column  j  just  after  Step  2,  and  let  Ri{i^j)  =  pj  +  z  denote  the  estimated  rank  of  the  key 
in  the  same  location  just  after  Step  3.  Let  ho  denote  the  maximum  value  of  i2o(*?j)  over 
all  0  keys,  and  let  /q  denote  the  minimum  value  of  iZo(^i)  over  aU  1  keys.  Let  hi  and  li  be 
defined  in  a  similar  manner.  Then  the  discussion  of  the  preceding  paragraph  implies  that 
ho-lo  <  p{a  +  d)  “  [p(a  -f-  1)  +  p  -  1]  =  pd  -  2p  -h  1.  Furthermore,  it  is  straightforward  to 
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prove  that  hi  <  ho  and  li  >  Iq. 


Hence,  hi  —  li  +  1  <  pd—2p+2.  This  bound  implies  that  after  Step  3,  every  key  is  within 
l{pd—2p+2)/{n/p)]  columns  (processors)  of  the  correct  output  column.  The  sort  can  now 
be  completed  by  recursively  sorting  each  of  the  (low-order)  subcubes  of  dimension  d', 
where 


d'  = 


log 


\p{pd-2p  +  2) 


n 


(8.1) 


and  then  merging  even  and  odd  pairs  of  subcubes  of  dimension  d'  as  in  Steps  5  and  6. 
The  order  of  these  two  merging  steps  is  interchangeable.  In  order  to  check  that  they 
actually  complete  the  sort,  it  suffices  to  prove  that  odd-even  transposition  sort  (see  [Knu73]) 
terminates  in  two  steps  when  the  input  is  such  that  every  key  is  at  most  one  move  away 
from  its  final  position.  This  fact  is  easy  to  prove,  and  that  it  is  sufficient  follows  from  the 
split-and- merge  technique  of  Baudet  and  Stevenson  [BS78]. 


It  follows  from  Equation  (8.1)  that  the  depth  of  the  recursion  is  0(logp/log(n/(plogp))). 
Since  the  cost  of  every  level  (excluding  recursive  calls)  of  the  recursion  is  bounded  by  that 
of  the  top  level,  the  total  running  time  of  the  non-adaptive  version  of  SmoothSort  on  the 
hypercube  or  shuffle- exchange  is 

Q  (  (n/p)log^p  \ 

\log{n/{plogp)))  ■ 

This  result  matches  the  asymptotic  performance  of  CubeSort  for  n  >  plog^^^  p,  where  € 
denotes  an  arbitrarily  small  positive  constant.  More  importantly,  the  multiplicative  constant 
hidden  by  the  0-notation  is  very  small,  particularly  for  the  hypercube  implementation.  For 
both  the  hypercube  and  shuffle-exchange,  the  constant  associated  with  CubeSort  is  almost 
an  order  of  magnitude  higher. 


There  are  a  number  of  tricks  that  can  be  used  to  speed  up  the  implementation  of 
SmoothSort  slightly.  In  Step  2,  the  communication  cost  can  be  reduced  by  performing  an 
unshuffle-merge,  that  is,  by  sending  every  second  key  to  the  neighboring  processor.  This 
has  the  effect  of  increasing  the  balancing  error  from  d  to  2d,  but  this  adverse  effect  is 
insignificant  if  n/p  is  large.  Step  3  may  run  faster  if  it  is  implemented  as  a  transpose 
(no  merging)  followed  by  a  local  sort.  Finally,  the  monotone  routes  in  Step  6  can  be 
eliminated  by  mapping  columns  (of  the  array  defined  above)  to  processors  in  a  different 
manner.  Specifically,  the  ith  largest  group  of  n/p  keys  should  be  sorted  to  the  processor 
with  ID  equal  to  the  ith.  binary  Gray  code,  rather  than  to  processor  i?  After  sorting  to 
^See  any  introductory  text  on  switching  theory  for  a  definition  of  binary  Gray  codes.  Gray  codes  are  a 
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this  configuration,  the  keys  can  be  routed  to  the  usual  sorted  configuration  in  {n/p)logp 
steps.  The  details  of  this  Gray  coded  scheme  axe  left  to  the  reader.  It  should  be  mentioned 
that  the  same  trick  can  be  used  to  halve  the  depth  of  the  shuffle-exchange  implementation 
of  the  balanced  sorting  network  of  Dowd  et  al.  [DPSR83]. 


8.2  The  SquareSort  Sorting  Circuit 

This  section  presents  a  sorting  circuit  based  on  recursive  merging  called  SquareSort.  Like 
BitonicSort,  the  depth  of  this  sorting  circuit  is  0(log^  n).  On  the  other  hand,  SquareSort 
leads  to  improved  tradeoffs  for  sorting  on  the  hypercube  and  shuffle-exchange  for  n  >  p. 
As  SquareSort  is  based  on  compare-interchange  operations,  the  zero-one  principle  implies 
that  it  is  sufflcient  to  consider  its  performance  on  inputs  consisting  entirely  of  O’s  and  I’s. 
The  SquareSort  algorithm  relies  on  the  following  merging  technique,  called  SquareMerge. 
Consider  a  rectangular  array  A  of  O’s  and  I’s  with  2“  rows  and  2^  columns,  where  a  and  b 
are  nonnegative  integers,  and  in  which  the  rows  and  columns  have  already  been  sorted  in 
ascending  order.  Note  that  the  boundary  between  the  O’s  and  the  I’s  in  array  A  forms  a 
staircase.  The  elements  of  A  may  either  be  viewed  as  being  organized  in  2®  sorted  lists  of 
length  2^,  or  in  2^  sorted  lists  of  length  2®.  The  goal  is  to  produce  a  single  sorted  list  of 
length  2®“*“^.  The  depth  of  the  SquareMerge  sorting  circuit  that  performs  this  merging  task 
will  be  denoted  M(a,6).  For  convenience,  the  merging  task  itself  will  also  be  referred  to  as 
M{a^b).  Note  that  for  all  nonnegative  integers  a  and  6,  M(a,6)  =  M(6, a)  and  M(a,0)  =  0. 
Furthermore,  the  problem  M(a,  1)  will  be  solved  by  a  bitonic  merge,  so  M(a,  1)  =  a  +  1, 
a  >  1.  The  most  interesting  case  remains  to  be  considered,  namely,  when  a  and  b  are  both 
greater  than  1.  Assume  without  loss  of  generality  that  a  >  6.  In  this  Cctse,  the  construction 
of  the  SquareMerge  circuit  will  satisfy 

M(a,  b)  =  M{  La/2J ,  6)  +  M(  [0/2] ,  b)  +  2M{  \al2'\  +6,1).  (8.2) 

The  following  procedure  for  performing  the  merging  problem  Af(a,6)  will  establish  the 
validity  of  Equation  (8.2).  First,  partition  the  rows  of  array  A  into  2r®/^^  groups,  placing 
row  i  into  group  i  mod  21^®/^^,  0  <  i  <  2®.  All  of  the  groups  can  be  sorted  in  parallel  in 
depth  M([a/2J,6).  The  resulting  21^®/^^  sorted  groups  of  size  must  now  be  merged 

in  depth  M(|'a/2],6)  +  2M([a/2]  +  6,1).  Consider  the  following  lemma. 


commonly  used  construct  for  obtaining  efficient  hypercube  embeddings;  see  [Joh87],  for  example. 
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Lemma  8.2,1  Let  integers  i  and  j  satisfy  0  <  i  <  j  <  Then  group  i  contains  fewer 

I’s  than  group  j.  Furthermore,  group  0  contains  at  most  2^  fewer  I’s  than  group  2^^^/^^  -  1. 

Proof:  This  follows  easily  from  the  existence  of  the  staircase  boundary  between  the  O’s 
and  the  I’s  in  array  A.  Q 

Arrange  the  21^®/^^  sorted  groups  in  an  array  A'  with  rows  and  columns. 

The  ith  row  consists  of  group  i,  arranged  in  ascending  order.  The  preceding  lemma  implies 
that  the  columns  are  also  sorted  in  ascending  order,  so  the  remaining  problem  can  be  solved 
as  an  M(|’a/2],  [a/2j  +6).  However,  there  is  additional  structure  to  the  remaining  problem 
that  permits  it  to  be  solved  more  rapidly.  Namely,  Lemma  8.2.1  implies  that  at  most  2^ 
columns  are  dirty  (a  column  is  dirty  if  it  contains  both  O’s  and  I’s),  and  that  the  dirty 
columns  are  contiguous.  Thus,  the  groups  can  be  merged  as  in  Steps  4  to  7  of  algorithm 
SquareMerge,  stated  below.  The  input  to  SquareMerge  is  a  2“  x  2^  array  A  of  O’s  and  I’s, 
where  the  rows  and  columns  have  already  been  sorted  ascending  and  a  >  b.  The  code  for 
a  <  &  is  similar. 

Algorit  hm  S q  u  are M  erge 

1.  Partition  the  rows  of  array  A  into  groups,  placing  row  i  into  group  i  mod  2^“/^! , 

0  <  i  <  2®.  Each  group  forms  a  subarray  with  rows  and  2^  columns. 

2.  Sort  all  of  the  groups  in  parallel.  Each  subproblem  is  an  M([a/2J,6). 

3.  Arrange  the  sorted  groups  in  an  array  A'  with  rows  and  columns. 

The  ith  row  consists  of  group  i,  arranged  in  ascending  order. 

4.  Partition  the  array  A'  into  subarrays  A'-,  where  the  ith  x2^ 

subarray  consists  of  the  columns  i2^  through  (i  +  1)2^  —  1  of  A',  0  <  i  < 

5.  Sort  all  of  the  2^°'!^^  subarrays  in  parallel.  Each  subproblem  is  an  M(l’a/2] ,  6). 

6.  Merge  A^,-  with  A2j_|_i,  0  <  i  <  Each  of  these  subproblems  is  an  M(|"a/2]  + 

6,1). 

7.  Merge  A^j^i  with  A2j_j.2>  0  <  i  <  —  1.  Each  of  these  subproblems  is  an 

M([a/21  +6,1). 
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The  fact  that  the  last  two  merge  operations  actually  complete  the  sort  follows  by  the 
same  argument  was  applied  in  Section  8.1.  Thus,  Equation  (8.2)  holds. 

The  recurrence  of  Equation  (8.2)  (with  base  cases  as  stated  above)  will  now  be  used  to 
obtain  an  upper  bound  on  M{a,b).  Assuming  without  loss  of  generality  that  a  >  6,  an  easy 
induction  proves  that  M{a,b)  <  M{a,a).  Two  applications  of  Equation  (8.2)  then  lead  to 

M{a,a)  <  4Af( [a/21,  +  0{a). 

Making  use  of  the  fact  that  \\xly\lz\  =  \xlyz\  for  all  positive  integers  x,  y  and  2,  the 
recurrence  can  be  unwound  further  to  obtain 

M(a,  a)  <  22*M(  [o/2*l ,  [a/2*l )  +  0(2*a),  (8.3) 

for  all  nonnegative  integers  k.  Setting  k  —  logo,  one  finds  that  M{a,a)  =  O(a^).  Hence, 
M{a,b)  =  0{a'^). 

The  SquareSort  sorting  circuit  can  now  be  defined  in  terms  of  SquareMerge.  In  the 
following  algorithm,  assume  that  SquareSort  is  sorting  2^  keys  for  some  nonnegative  integer 
a,  and  let  S{x)  denote  the  depth  of  the  SquareSort  sorting  circuit  on  2^  keys. 

Algorit  hm  S  q  u a reSort 

1.  If  a  <  3  then  sort  the  keys  using  BitonicSort  and  return. 

2.  Arrange  the  2“  keys  in  a  x  array  A. 

3.  Sort  each  of  the  rows  of  A  recursively  in  parallel.  This  uses  depth  5([a/2j). 

4.  Sort  each  of  the  columns  of  A  recursively  in  parallel.  This  uses  depth  S'(|'a/2]). 

5.  Apply  SquareMerge  to  the  array  A,  This  uses  depth  M([a/2J,  ['a/2]),  which  is  O(a^) 
by  the  preceding  analysis. 

Thus,  an  upper  bound  on  the  depth  of  the  SquareSort  sorting  circuit  is  given  by  the 
solution  to  the  recurrence 


5(a)  <  5(  [a/2j )  +  5(1  a/2l )  +  0{a^),  (8.4) 

with  5(1)  equal  to  a  constant.  Unwinding  this  recurrence,  one  finds  that  5(a)  =  O(a^),  as 
promised.  Of  course,  this  result  is  not  very  interesting  in  view  of  the  fact  that  BitonicSort 
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achieves  the  same  bound  with  a  much  simpler  construction  and  a  smaller  multiplicative 
constant.  The  significance  of  SquareSort  is  that,  like  CubeSort,  it  gives  a  method  for  ex¬ 
pressing  a  single  large  sort  in  terms  of  a  number  of  smaller  ones.  Both  techniques  have 
the  same  (within  a  constant  factor)  efficiency  in  this  regard,  except  that  CubeSort  places  a 
lower  bound  on  the  size  of  the  smaller  sorts.  Thus,  in  certain  cases,  SquareSort  leads  to  a 
tradeoff  while  CubeSort  does  not.  The  main  utility  for  expressing  a  large  sort  in  terms  of 
smaller  ones  is  as  follows.  If  some  other  sorting  method  can  be  used  to  speed  up  the  small 
sorts,  then  SquareSort  can  be  easily  modified  to  reduce  the  running  time  of  the  large  sort 
accordingly. 

Formally,  suppose  that  sets  of  keys  of  size  2^,  where  2*’  <  2“,  can  be  sorted  in  depth  T 
by  some  circuit  X.  Using  Equation  (8.3)  with  A;  =  log  a  -  log  6  -|-  0(1)  gives 

M(ra/2l,[a/2j)  = 


since  T  =  VL{h).  Thus,  using  circuit  X  to  sort  sets  of  size  2*,  the  depth  of  the  SquareSort 
sorting  circuit  satisfies  the  recurrence 

S(«)  <  5(ia/2J)  +  J(ro/21)  +  O  ,  (8.5) 

with  5(1)  equal  to  a  constant.  Three  applications  of  this  technique  are  given  in  Sec¬ 
tions  8.2.2,  8.2.3  and  8.2.4.  First,  however,  it  must  be  shown  that  the  SquareSort  sorting 
circuit  can  be  efficiently  implemented  on  the  hypercube  and  shuffle-exchange,  that  is,  with 
a  running  time  that  is  proportional  to  its  depth.  This  is  the  subject  of  the  following  section. 

8.2.1  Network  Implementations  of  SquareSort 

It  is  relatively  easy  to  obtain  an  efficient  implementation  of  SquareSort  on  the  hypercube. 
Only  the  case  n  —  p  (one  key  per  processor)  needs  to  be  considered  explicitly;  Sections  8.2.2, 
8.2.3  and  8.2.4  give  tradeoffs  with  other  sorting  algorithms  for  n  ^  p.  In  the  case  n  =  p, 
there  are  p  keys  in  the  array  A  defined  by  the  top-level  call  to  algorithm  SquareSort. 

Now  consider  the  effect  of  embedding  the  array  A  in  the  hypercube  in  row-major  order. 
Given  this  embedding,  it  is  straightforward  to  prove  that  every  array  encountered  during  the 
execution  of  SquareSort  satisfies  the  property  that  its  row  and  column  indices  are  encoded 


92 


CHAPTER  S.  NON^ADAPTIVE  SORTING  ALGORITHMS 


by  two  disjoint,  contiguous  sets  of  ID  bits.  Note  that  in  algorithm  SquareMerge,  only  Steps  6 
and  7  involve  any  key  comparisons  and/or  movement  of  data.  The  other  steps  consist  of 
recursive  calls  and  trivial  computations  to  set  up  the  recursive  calls  (to  calculate  which  sets 
of  ID  bits  define  the  row  and  column  indices  of  the  array  to  be  sorted  by  the  recursive  call). 
Step  6  involves  merging  pairs  of  equal-length  sorted  lists.  Each  list  resides  in  a  subcube 
of  dimension  and  each  pair  of  lists  resides  in  a  single  subcube  of  the  next  higher 

dimension.  Thus,  the  merging  operation  can  be  implemented  by  reversing  one  of  the  lists 
and  then  performing  a  bitonic  merge,  all  of  which  can  be  done  in  0{a)  time. 

Step  7  is  slightly  trickier  to  implement.  Once  again,  the  task  is  to  merge  pairs  of  equal- 
length  sorted  lists  where  each  list  resides  in  a  sub  cube.  The  difficulty  is  that  the  pairs  of 
lists  to  be  merged  are  not  necessarily  located  in  adjacent  sub  cubes.  One  solution  to  this 
problem  is  to  perform  a  monotone  route  that  shifts  the  location  of  each  key  in  array  A' 
by  the  length  of  a  sorted  list.  The  pairs  of  lists  can  then  be  merged  as  in  Step  6, 

and  the  merged  lists  routed  back  to  the  appropriate  position  by  a  second  monotone  route. 
The  time  required  to  perform  each  monotone  route  is  proportional  to  the  dimension  of  the 
subcube  containing  A\  Thus,  the  total  time  required  to  perform  Step  7  is  also  0(a). 

The  preceding  discussion  implies  that  the  total  running  time  of  SquareMerge,  excluding 
the  cost  of  recursive  calls,  is  0(a).  Hence,  the  analysis  of  Section  8.2  goes  through  unchanged 
and  the  upper  bound  of  Equation  (8.2)  also  applies  to  the  running  time  of  the  hypercube 
implementation  of  SquareMerge.  Given  subroutine  SquareMerge,  algorithm  SquareSort  is 
straightforward  to  implement  efficiently  on  the  hypercube.  Therefore,  the  circuit  depth 
bounds  of  Equations  (8.4)  and  (8.5)  carry  over  to  running  time  bounds  for  the  hypercube 
implementation  of  SquareSort. 

Figures  8.1  and  8.2  apply  the  SquareSort  sorting  technique  to  a  random  permutation  of 
the  2^  integers  [0,64).  The  example  is  not  entirely  faithful  to  the  SquareSort  program  stated 
in  Section  8.2,  since  a  =  6  and  the  array  A  was  chosen  to  be  4  X  16  rather  than  8x8.  The  bit 
sequences  labelling  the  rows  and  columns  of  each  of  the  4  X  16  arrays  in  Figures  8.1  and  8.2 
indicate  how  the  array  is  mapped  to  the  hypercube  processors.  For  instance,  the  columns 
of  the  first  array  are  labelled  6362^^1  ^tnd  the  rows  are  labelled  6564,  which  means  that  the 
key  in  row  i  =  (riio)2  and  column  j  =  (73727170)2  is  located  at  processor  (riio73727i7o)2- 

The  asymptotic  performance  of  SquareSort  on  the  hypercube  can  be  duplicated  on  the 
shuffie-exchange,  but  not  simply  by  a  naive  translation  of  the  hypercube  implementation. 
There  are  two  problems  that  must  be  avoided  in  order  to  ensure  that  the  bitonic  merge. 
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I  Step  3  of  SquareSort 
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Step  1  of  SquareMerge 
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Figure  8.1:  A  sample  run  of  SquareSort  (continued  in  Figure  8.2) 
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Steps  3  and  4  of  SquareMerge 
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Step  5  of  SquareMerge 
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Step  6  of  SquareMerge 
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Step  7  of  SquareMerge 
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Figure  8.2:  A  sample  run  of  SquareSort  (continued  from  Figure  8.1) 
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list  reversal  and  monotone  route  operations,  which  are  performed  over  subcubes,  can  be 
executed  in  time  proportional  to  the  dimension  of  the  subcube  (as  opposed  to  the  dimension 
of  the  entire  shuffle-exchange,  for  instance).  Recall  that  the  subcubes  of  interest  correspond 
to  the  arrays  defined  by  SquareSort  and  SquareMerge,  and  that  the  row  and  column  indices 
are  given  by  two  disjoint,  contiguous  sets  of  address  bits.  The  first  problem  is  that  the  row 
and  column  address  bits  can  be  far  apart,  forcing  a  pass  over  the  corresponding  dimensions 
to  include  a  long  sequences  of  shuffle  or  unshuffle  operations.  The  second  problem  is  that 
even  if  the  row  and  column  address  bits  form  a  single  contiguous  block,  this  block  may  be 
far  from  the  exchange  (bit  0)  position. 

Both  of  these  problems  may  be  solved  by  permuting  the  data  before  each  recursive 
call  to  SquareMerge  in  order  to  bring  the  row  and  column  bits  together.  Note  that  such  a 
permutation  is  not  always  necessary.  Specifically,  the  array  associated  with  the  recursive 
call  in  Step  5  of  algorithm  SquareMerge  is  already  mapped  to  an  appropriate  subcube,  while 
the  one  associated  with  Step  5  is  not  (for  the  case  a  <  b,  the  situation  is  reversed).  Where 
it  is  needed,  the  appropriate  permutation  can  be  performed  efficiently  using  the  self-routing 
Benes  network  of  Nassimi  and  Sahni  [NS81].  Of  course,  the  inverse  permutation  must  be 
applied  once  the  recursive  call  to  SquareMerge  completes  execution. 


8.2.2  An  Adaptive  Tradeoff  for  n  <p 

Consider  the  problem  of  sorting  n  keys  with  p  processors,  where  n  <  p.  For  this  range, 
the  MergeSort  algorithm  of  Na.ssimi  and  Sahni  runs  in  0{\og^ p/log{p/n))  time.  MergeSort 
reduces  to  a  particularly  simple  algorithm  when  p  >  n^.  The  purpose  of  this  section 
is  to  demonstrate  that  the  same  performance  is  achieved  by  a  hybrid  algorithm  based 
on  SquareSort  and  the  simple  version  of  MergeSort  for  p  >  v?.  The  hybrid  algorithm  is 
SquareSort  except  that  MergeSort  is  applied  to  perform  sorts  that  are  sufficiently  small 
to  allow  a  quadratic  number  of  processors  to  be  applied.  This  is  an  adaptive  tradeoff, 
since  MergeSort  does  not  correspond  to  a  sorting  circuit.  The  running  time  of  the  hybrid 
algorithm  on  a  hypercube  or  shuffle-exchange  of  dimension  a  is  given  by  Equation  (8.5) 
with  b  =  T  =  l0g{p/n).  Solving  this  recurrence,  and  setting  a  =  logp,  leads  to  a  running 
time  of  O(log^p/l0g(p/n))  for  n  <  p,  as  claimed. 
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8.2.3  A  Non-Adaptive  Tradeoff  for  n  >  p 

The  non-adaptive  CubeSort  algorithm  of  Cypher  and  Sanz  runs  in  O((n/p)log^  p/log(n/p)) 
time  on  the  hypercube  or  shuffle-exchange  assuming  that  n  >  plog^^^  p  for  some  constant 
k  [CS88].  The  constant  hidden  by  the  0-notation  is  exponential  in  k  and  is  moderate  even 
for  k  =  1.  The  non-adaptive  version  of  SmoothSort  presented  in  Section  8.1  has  the  same 
time  complexity  for  n  >  plog^''''^p.  However,  SmoothSort  has  a  very  small  multiplicative 
constant  (particularly  on  the  hypercube),  and  appears  to  be  a  truly  practical  algorithm  for 
certain  realistic  values  of  n  and  p. 

The  purpose  of  this  section  is  to  prove  that  a  hybrid  algorithm  based  on  SquareSort 
and  the  non-adaptive  version  of  SnrioothSort  (CubeSort  could  also  be  used  here,  at  the 
expense  of  a  constant  factor)  runs  in  0((n/p)log^p/log(n/p))  time  on  the  hypercube  or 
shuffle-exchange  over  the  entire  range  n  >  p.  The  hybrid  algorithm  is  SquareSort  with  the 
exception  that  SmoothSort  is  applied  to  sorts  over  subcubes  consisting  of  fewer  than  n/p 
processors.  The  analysis  of  the  hybrid  algorithm  is  very  similar  to  that  of  the  previous 
section.  Namely,  the  running  time  of  the  hybrid  algorithm  is  given  by  the  recurrence  of 
Equation  (8.5)  with  a  =  logp  and  b  =  T  =  l0g(Ti/p),  which  leads  to  0{\og^ p/\0g{n/p))  for 
n  >  p.  Note  that  the  hybrid  algorithm  is  non-adaptive. 


8.2.4  An  Adaptive  Tradeoff  for  p  <n  <  pq 

This  section  analyzes  the  performance  of  a  hybrid  sorting  algorithm  for  the  hypercube  based 
on  SquareSort  and  QuickSort.  The  range  of  applicability  of  the  algorithm  is  p  <  n  <  pq, 
where  q  =  log^^^ploglogp.  The  hybrid  algorithm  is  SquareSort,  except  that  QuickSort 
is  applied  to  every  sorting  subproblem  involving  n'  keys  in  a  subcube  of  p'  processors 
where  n'  =  Q{p'\o^^^  pfloglogp').  Using  the  fact  that  n'  =  i'n/p)j/,  this  condition  implies 
log^/^p'  =  0((n/p)/log(Ti/p)).  The  cost  of  running  QuickSort  on  a  problem  of  this  size  is 
O(log^p'loglogp')  =  O((n/p)^/log(n/p)).  Thus,  the  running  time  of  the  hybrid  algorithm 
is  given  by  the  recurrence  of  Equation  (8.5)  with 

a  =  logp, 

^3/2  _  0  iog(n/p)) ,  and 

T  =  O  ((7i/p)Vlog(n/p))  . 
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Algorithm 

Running  Time 

Transition  Region 

MergeSort 

BitonicSort 

hybrid 

O(log^p/log(p/n)) 
0{{n/p)\og^p) 
0{{n/p)log^  p/  l0g{n/p)) 

n  =  Q{p) 
n  =  Q{p) 

Table  8.1:  Running  times  of  sorting  algorithms  for  the  shuffle-exchange. 

Solving  this  recurrence,  one  finds  that  the  running  time  of  the  hybrid  algorithm  is 

O  (log^p(n/p)2/^log^/^(ra/p))  , 

as  claimed  in  Table  7.2. 

Two  points  concerning  this  hybrid  algorithm  should  be  emphasized.  First,  the  algorithm 
does  not  run  (with  the  stated  complexity)  on  the  shuffle-exchange,  since  it  makes  use  of 
Quicksort.  Second,  replacing  QuickSort  with  SmoothSort  does  not  yield  any  improvement. 


8.3  Summary 

This  chapter  described  two  non-adaptive  sorting  methods  and  a  number  of  hybrid  algo¬ 
rithms.  With  the  exception  of  the  adaptive  tradeoff  considered  in  Section  8.2.4,  all  of 
the  results  discussed  in  this  chapter  apply  to  the  shuffle-exchange  as  well  as  the  hyper¬ 
cube.  Table  8.1  summarizes  the  running  times  of  the  best  known  sorting  algorithms  for  the 
shuffle-exchange  over  ascending  ranges  of  the  ratio  n/p.  For  n  <  p,  the  bounds  are  the  same 
as  for  the  hypercube  (see  Table  7.2).  For  n  =  a;(p),  the  hybrid  algorithm  of  Section  8.2.3 
provides  the  best  known  bound.  Note  that  for  n  =  ^(plog^^^p),  where  fc  is  a  fixed  positive 
integer,  the  running  time  of  the  hybrid  algorithm  is  matched  (to  within  a  constant  factor) 
by  CubeSort.  Furthermore,  it  is  matched  by  the  non-adaptive  version  of  SmoothSort  for 
n  =  ft(plog^"*"^  p),  and  by  ColumnSort  for  n  =  f)(p^"^^),  where  in  each  case  €  denotes  an 
arbitrarily  small  positive  constant. 

The  non-adaptive  version  of  SmoothSort  does  not  perform  any  explicit  selections  and 
appears  to  be  the  most  practical  sorting  method  for  sufficiently  large  values  of  n  (say,  10^ 
or  more),  and  where  n  exceeds  p  by  a  significant  polynomial  factor  (e.g.,  n  =  p^). 

Roughly  speaking,  the  construction  of  the  SquareSort  sorting  circuit  is  based  on  a  tech¬ 
nique  for  expressing  a  single  large  sort  in  terms  of  a  number  of  smaller  sorts.  If  the  size 
of  the  smaller  sorts  is  sufficiently  large,  CubeSort  performs  such  a  decomposition  with  the 
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same  asymptotic  efficiency,  although  the  multiplicative  constant  associated  with  CubeSort 
is  almost  an  order  of  magnitude  higher. 


Chapter  9 


Concluding  Remarks 


The  preceding  chapters  have  primarily  considered  algorithms  for  load  balancing,  selection 
and  sorting  on  the  hypercube  and  shuffle-exchange.  While  some  progress  has  been  made 
in  these  areas,  many  open  problems  remain.  Several  open  problems  which  correspond  to 
natural  extensions  of  the  work  described  in  this  thesis  will  now  be  considered. 

Section  4.1  presented  upper  and  lower  bounds  for  the  Balance  operation  running  on  the 
hypercube.  The  bounds  are  not  tight  in  the  case  where  the  average  number  of  tokens  per 
processor  is  less  than  a  constant  fraction  of  the  maximum  number  of  tokens  at  any  processor, 
and  it  seems  likely  that  the  upper  bound  could  be  improved  in  this  case.  One  might  also 
attempt  to  match  the  current  performance  of  Balance  with  a  hypercube  algorithm  that 
restricts  all  processors  to  communicate  along  the  same  dimension  at  any  given  time. 

It  would  be  interesting  to  try  to  prove  a  a;(logn)  lower  bound  for  the  problem  of  sorting 
n  keys  on  a  hypercube  or  shuffle-exchange  with  n  processors,  at  least  for  some  restricted 
class  of  algorithms.  For  example,  one  might  consider  non-adaptive  algorithms  or,  being  less 
restrictive,  arbitrary  algorithms  with  the  sole  restriction  that  keys  cannot  be  duplicated. 
Note  that  even  for  the  simpler  problem  of  selection,  there  is  no  known  o(log^  n)  hypercube 
algorithm  which  does  not  duplicate  keys.  With  respect  to  proving  a  lower  bound,  the 
techniques  of  Chapter  6  may  provide  a  useful  starting  point. 

All  of  the  sorting  algorithms  described  in  this  thesis  have  the  property  that  the  progress 
achieved  by  the  algorithm  at  any  given  time  is  relatively  ecisy  to  characterize.  For  example, 
the  partial  progress  of  SmoothSort  is  given  by  the  least  integer  k  such  that  every  key  has 
been  routed  to  the  correct  low-order  subcube  of  dimension  k.  On  the  other  hand,  the  partial 
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progress  of  the  O(logn)  depth  AKS  sorting  circuit  is  more  complicated  to  characterize;  in 
particular,  it  only  guarantees  progress  with  respect  to  most  of  the  keys,  as  opposed  to  aU 
of  them  (except  at  the  end,  of  course).  Thus,  it  may  be  worthwhile  to  investigate  sorting 
algorithms  which  do  not  operate  by  partitioning  the  sorting  task  into  disjoint  subproblems, 
but  instead  perform  successively  refined  approximate  sorts  over  the  entire  set  of  keys. 


Appendix  A 


Expansion  Properties  of  the 
Hypercube 


The  calculations  in  this  appendix  analyze  the  volume-to-surface  ratio  of  a  Hamming  ball 
of  radius  r  =  r(d)  lying  in  a  hypercube  of  dimension  d.  Theorem  A.  1.1,  which  is  used  in 
Section  4.1.2,  characterizes  the  asymptotic  behavior  of  this  ratio  for  r  in  the  range  0  to  c//2. 
The  results  could  easily  be  extended  to  handle  higher  values  of  r  (that  is,  d/2  <  r  <  d)  by 
taking  advantage  of  symmetry. 


A.l  Asymptotic  Analysis 


Definition  A,l,l  Let  denote  X]o</<7-  (/)/(r)* 


Lemma  A. 1.1  Let  d  and  r  be  positive  integers,  1  <  r  <  d.  Then  >  Rd,r-i- 


Proof;  Observe  that  R^^q  =  1  and  for  1  <  r  <  d 


1  — 


+  1  \0^<rVJ  / 

r  .  (d\  /  (  d  \ 

mm 


E 


> 


d  —  r  +  1  l<Kr  \l )  /  V  “  ly 

r  d  —  Z  +  1 

mm 


d  —  r  +  1  l<Kr  / 

=  1. 
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Lemma  A. 1.2  For  positive  integers  d  and  r  <  d/2, 


Rd,r  =  0 


z  +  1  r 


where  r  =  d/2  —  y/dz. 


Proof:  '  The  following  three  cases  will  be  considered  separately:  z  =  D(d)  and  .sr  <  d/4; 
2r  =  o(d)  and  z  >  1;  0  <  z  <  1. 

Case  1:  z  =  Cl{d)  and  z  <  d/4.  Exercise  (9.42)  of  Graham,  Knuth  and  Patash- 
nik  [GKP89]  establishes  that  Rd^r  =  0(1)  this  range. 

Case  2:  z  =  o{d)  and  z  >  1,  It  is  sufficient  to  prove  that  Rd^r  =  0  since 

2:  =  fi(l).  For  the  lower  bound,  consider  the  inequality 


oiiO  -  (■- i^j)’ 


and  observe  that  for  sufficiently  large  values  of  d 


.r  i 


> 


d  —  r  + 


iv^i 

[\/^\] 


\d/2  +  y/dz  +  ^/d/z  ) 


> 


dl2-2y/Tz'Y^^ 

d/2  +  2Vdz  J 


-  (‘ 

which  converges  to  =  (2(1).  Hence,  Rd,r  =  ^{\/dfz). 
For  the  upper  bound,  note  that 

f  d  \  /(d 


J-1. 


I 


< 


d  -  r  +  I' 
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for  1  <  /  <  r.  Hence,  the  sum  l^o</<r  (1)  dominated  by  the  infinite  geometric  progression 
with  initial  value  (^)  and  ratio  r/(d  -  r  +  1)  between  successive  terms.  Thus, 


■Rd.r  < 


d  -  r  +  1 


d  —  2r  +  1 
=  0 


Case  3:  0  <  2:  <  1.  It  is  sufficient  to  prove  that  Iicl,r  =  ©  ('^)  z  =  0(1).  A 

lower  bound  on  can  be  obtained  as  follows: 


Furthermore,  for  sufficiently  large  values  of  d 


~  [d/2A2^/dJ 


which  converges  to  e  ®  =  12(1).  Hence,  Rd,r  =  ^(Vd). 

For  the  upper  bound,  Stirling’s  approximation  can  be  used  to  show  that  Rd,id/2}  = 
Q(-\/d).  It  follows  from  Lemma  A.  1.1  that  R^^r  —  0(^/d).  [] 


Theorem  A.1.1  Let  d  be  a  given  integer  and  let  r  =  r(d)  be  an  integer  between  0  and 
d/2.  If  Eo<Kr  (f)  =  then  Rd,r  =  Q(y/k). 

Proof:  The  following  four  cases  will  be  considered  separately;  r  —  d/2  —  {1(d)  and  r  >  0; 
r  =  cl/2  -  o(d)  and  r  <  d/2  -  d/2  —  d?!^  <  r  <  d/2  -  y/d;  d/2  -  \/d  <  r  <  d/2. 

Case  1:  r  =  d/2  -  {1(d)  and  r  >  0.  In  this  range,  Rd,r  —  ©(1)  by  Lemma  A. 1.2,  and 

k  =  0(1)  by  Exercise  (9.42)  of  [GKP89].  Hence,  Rd,r  =  ©(V^). 

Case  2:  r  =  d/2  —  o(d)  and  r  <  d/2  —  .  Let  r  =  d/2  —  hence  i  = 

1/2  —  t<;(l/ log  d)  and  b  >  1/6.  As  in  the  preceding  case,  Rd^r  —  C(ddl^~^)  by  Lemma  A. 1.2. 
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The  logarithmic  form  of  Stirling’s  approximation  implies  that  Inn!  =  nlnn  —  n  +  0(ln  n). 
Hence, 

In  =  dlnd -  rlnr  —  (c?— r’)ln(<f  —  r)  +  O(lnd) 

=  dlnd-  (d/2  -  d^+^/^)ln(d/2  - 

-(d/2  +  d^+^/2)  +  d^+^/2)  ^ 

=  dln2-(d/2-d^+^/2)ln(l-2d^-^/2) 

-(d/2  +  d^+^/2)ln(l  +  2d^-^/^)  +  O(lnd). 


The  following  pair  of  inequalities  may  be  easily  derived  from  the  Taylor’s  series  expansion 
of  ln(l  +  a;): 

X  —  x^  12  <  ln(l  +  x)  <  I,  X  >  0,  and 
—X  —  x^  <  ln(l  —  x)  <  —X  —  x^/2,  0  <  X  <  i. 

These  inequalities  imply  that  for  suflBciently  large  values  of  d, 

din 2  -  (d/2  -  d^+^/2)(-2d^-i/2  _  2d2S-i) 

-(d/2  4-  d^+^/2)('2/-i/2)  ^ 
din  2  -  Sd^^  -  2d®^“^/2  ^ 
din  2  —  5d^'^  +  0(ln  d), 


and 


<  dln2  -  (d/2  -  d^+^/2)(-2d^"^/2  - 

-(d/2  +  d^+^/2)(2d^“^/2  -  2d2^-^)  +  O(lnd) 
=  din 2  -  d^^  -  2d^^"^/2  +  0(ln  d) 

<  dln2-d2^  +  0(lnd). 


Thus,  Xro</<r  (/)  =  Rd,rif)  =  2*^  where  the  O(lnd)  term  has  been  absorbed  into  the 

Q{<P^)  term  (using  the  fact  that  d  >  1/6).  Hence,  k  =  0(d^“^^)  and  Rd^r  =  0(V^). 

Case  3:  d/2  —  d?/^  <  r  <  d/2  —  Vd.  Let  r  =  d/2  —  d^/^’*"'^,  0  <  ^  <  1/6.  In  this  case, 
Rd^r  =  0(d^/^~^)  by  Lemma  A. 1.2.  Equation  (9.98)  of  [GKP89]  implies  that 


=  0 


^  -2<P‘ 

y/d 
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so  multiplying  by  Rd,r  gives 

for  k  =  rfln2/(^lncl  +  2<f^^).  Observe  that  k  =  since  =  fi(^lnd),  5  >  0.  Hence 

Rd,r  =  Q(y/k). 

Case  4:  d/2  —  \fd  <  r  <  d/2.  In  this  case,  =  Q{Vd)  by  Lemma  A.l. 2.  Together 
with  Equation  (9.98)  of  [GKP89],  this  implies  that  X2o</<r  (f)  =  0(2'^)-  Hence,  k  =  0(d) 
and  Rd,r  =  Q{Vk).  0 
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